ADA034600 


ESD-TR-76-112 


Technical  Report 

523 

Acoustic  Characteristics 
of  Stop  Consonants: 

A Controlled  Study 

V.  W.  Zue 

17  May  1976 


Prepared  for  the  Defense  Advanced  Research  Projects  Agency 
under  Electronic  Systems  Division  Contract  FI9628-76-C.-0002  by 

Lincoln  Laboratory 

MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
Lexington,  Massachusetts 


Approved  for  public  release;  distribution  unlimited. 


D D C 

r?CT?m  nr? 


jan  si  urr 

TSISEUTTE 


The  work  reported  in  this  document  was  performed  at  Lincoln  Laboratory, 
a center  for  research  operated  by  Massachusetts  Institute  of  Technology. 
This  wotk  was  sponsored  by  the  Defense  Advanced  Research  Projects 
Agency  under  Air  Force  Contract  Fl%28-76-C-0002  (ARPA  Order  2006). 

This  report  may  be  reproduced  to  satisfy  needs  of  U.S.  Government  agencies. 


The  views  and  conclusions  contained  in  this  document  are  those  of  the 
contractor  and  should  not  be  interpreted  as  necessarily  representing  the 
official  policies,  either  expressed  or  implied,  of  the  Defense  Advanced 
Research  Projects  Agency  of  the  United  States  Government. 

[ 

I 

This  technical-report  has  been  reviewed  and  is  approved  for  publication. 
FOR  THE  COMMANDER 

Raymond  L.  Loiselle,  Lt.  Col.,  USAF 
Chief,  ESD  Lincoln  Laboratory  Project  Office 


Non-Lincoln  Recipients 

PLEASE  DO  NOT  RETURN 


Permission  is  given  to  destroy  this  document 
when  it  is  no  longer  needed. 


ACOUSTIC  CHARACTERISTICS  OF  STOP  CONSONANTS: 
A CONTROLLED  STUDY 


V.  W.  ZVE,  Consultant 
Group  24 


TECHNICAL  REPORT  523 
17  MAY  1976 


D D C 

Approved  for  public  release;  distribution  unlimited. 


MASSACHUSETTS 


MISSING  PAGE 
NUMBERS  ARE  BLANK 
AND  WERE  NOT 
FILMED 


ABSTRACT 


, t The  research  described  in  this  report  has  two  distinct  and  integral  parts.  The  first 

partes  directed  toward  <th^" development  of  a highly  interactive  computer  facility 
where  controlled  studies  of  the  acoustic  characteristics  of  selected  consonants, 
consonant  clusters,  and  vowels  in  a prescribed  phonetic  environment  can  be  cai — 
ried  out.  In  conjunction  with  the" development  of  the  data-base  facility,  a large 
corpus  of  acoustic  data^hee  beett5 collected.  The  format  of  the  data  is  a nonsense 

h^»CVC  utterance  embedded  in  a carrier  sentence  *Say again/  where  the 

consonants  and  vowels  are  systematically  varied.  Fifteen  vowels  and  diphthongs 
were  used  to  form  the  syllable  nuclei  and  51  word-initial  consonants  and  conso- 
nant clusters  were  included. 

The  second  half  of  the  report  utilizes  the  collected  data  and  the  developed  facility 
to  study  the  acoustic  characteristics  of  English  stops,  both  in  singleton  and  in 
clusters.  The  data  included  1 728  utterances  spoken  by  3 male  speakers.  Various 
aspects  of  the  temporal  and  spectral  characteristics  of  these  stops  were  quantified 
and  discussed  in  detail.  The  findings  in  general  suggest  the  presence  of  context 
independent  acoustic  properties  for  these  stops.  The  exact  nature  of  the  acoustic 
invariance,  however,  still  remains  a topic  of  further  investigation. 


♦This  report  is  based  on  i thesis  of  the  same,  title  submitted  to  the  Department 
of  Electrical  Engineering  and  Computer  Science  at  the  Massachusetts  Institute 
of  Technology  on  14  May  1976,  in  partial  fulfillment  of  the  requirements  for  the 
degree  of  Doctor  of  Science. 


iii 


V 


I 

PKECEDIIC  pate  BLANK-NOT 


FILMED 


CONTENTS 


Abstract  iii 

Acknowledgments  vii 

CHAPTER  1 - INTRODUCTION  1 

1.1  Physiology  and  Acoustics  of  Speech  Production  1 

1.2  Linguistic  Framework  3 

1.3  Problem  Areas  3 

1.4  Literature  Review  4 

1.5  Overview  5 

CHAPTER  2 - DEFINING  THE  RESEARCH  GOAL  AND  THE  DATA 

COLLECTION  PROCESS  7 

2.1  Scope  of  the  Study  7 

2.2  Data  Format  8 

2.3  Acquisition  of  Acoustic  Data  8 

CHAPTER  3 - ANALYSIS  SYSTEM  AND  DATA-BASE  FACILITY  13 

3.1  Acoustic  Analysis  Procedure  13 

3.1.1  Linear  Prediction  13 

3.1.2  Results  Comparing  Various  Spectrum  Analysis  Techniques  17 

3.2  Computer  Systems  21 

3.2.1  Univac-FDP  System  21 

3.2.2  TX-2  System  23 

3.3  Procedures  for  Data  Processing  and  Data-Base  Facility  23 

3.3.1  Processing  on  the  Univac-FDP  23 

3.3.2  Data-Base  Facility  25 

CHAPTER  4 - TEMPORAL  CHARACTERISTICS  OF  ENGLISH  STOPS  31 

4.1  Measurements  and  Techniques  31 

4.2  Summary  of  Data  33 

4.3  Results  33 

4.3.1  Singleton  Stops  34 

4.3.2  Stops  in  Clusters  39 

4.4  Discussion  39 

CHAPTER  5 - SPECTRAL  CHARACTERISTICS  OF  ENGLISH  STOPS  43 

5.1  Measurements  and  Techniques  43 

5.2  Results  45 

5.2.1  Singleton  Stops  45 

5.2.2  Stops  in  Clusters  51 

5.3  Discussion  55 

CHAPTER  6 - CONCLUDING  REMARKS  59 

References  60 


v 


FIIMr'D 


ACKNOW  I . K D O M L \ T S 


I would  like  to  express  m\  sincere  gratitude  to  Professor  Kenneth  Stevens 
for  his  lime  and  effort  in  supervising  this  report  During  my  past  six  years 
a!  M.I.T.  I have,  on  numerous  occasions,  benefited  tremendously  from  his 
constant  guidance,  his  penetrating  questions,  his  insightful  remarks,  and 
his  infinite  patience.  My  association  with  him  represents  the  high  point  of 
my  educational  career.  Besides  thanking  the  readers  for  reading  and  offer- 
ing valuable  comments  on  the  manuscript,  I would  also  like  to  thank  each 
one  of  them  individually:  to  Dennis  Klatt  for  his  constant  tutorage  in  acoustic 
phonetics,  to  Alan  Oppenheim  for  teaching  me  what  digital  signal  processing 
is  all  about;  to  Men  Gold  for  his  friendship,  and  his  generosity  in  resources 
and  manpower. 

Most  of  the  research  described  in  this  report  was  conducted  at  the  M.I.T. 
Lincoln  Laboratory.  To  all  of  those  at  Lincoln  Laboratory  who  had  helped 
me  along  he  way,  I acknowledge  their  assistance  with  appreciation. 


PFECEDI1C  PAQI  BLANK-NOT  FII/tCD 


I 


l.MH-l  | ^ , 


ACOUSTIC  CHARACTERISTICS  OF  STOP  CONSONANTS 
A CONTROLLED  STUDY 


CHAPTER  1 
INTRODUCTION 

In  the  process  of  human  communication  by  spoken  language,  the  speech  signal  (i.e..  the 
acoustic  waveform)  plays  a very  unique  role.  On  one  hand,  it  is  the  final  result  of  the  complex 
encoding  that  transpires  at  various  stages  of  the  speech  production  process.  At  the  receiving 
end,  however,  the  speech  signal  is  the  principal  information  carrier  upon  which  the  perceptual 
decoding  process  must  operate. 

Because  of  the  relative  ease  of  access  and  manipulation  of  the  .speech  signal,  it  has  been 
the  focus  of  many  past  research  efforts  that  seek  to  better  understand  the  nature  of  language. 
Although  much  has  been  learned  about  the  acoustic  events  of  speech,  our  understanding  of  the 
relationship  between  the  acoustic  characteristics  of  speech  sounds  and  their  underlying  linguis- 
tic units  still  remains,  for  the  most  part,  vague. 

The  purpose  of  this  report  is  to  probe  further  into  this  relationship  for  a subset  of  the  En- 
glish speech  sounds;  namely,  the  stop  consonants.  The  study  is  carried  out  under  a controlled 
linguistic  environment,  using  a data-base  facility  designed  for  acoustic  phonetic  research. 

Before  reviewing  past  research  on  the  subject  and  pointing  out  the  problem  areas,  it  is 
appropriate  to  first  provide  a brief  account  of  the  physiology  and  the  acoustics  of  speech  pro- 
duction, as  well  as  to  summarize  the  linguistic  framework  on  which  this  research  is  based. 

1.1  PHYSIOLOGY  AND  ACOUSTICS  OF  SPEECH  PRODUCTION 

Speech  is  generated  by  closely  coordinated  movements  of  several  groups  of  human  anatom- 
ical structures.  One  such  group  of  structures  consists  of  those  that  enclose  the  air  passage 
below  the  larynx.  Through  control  of  the  muscles  and  through  forces  generated  by  the  elastic 
redoil  of  the  lungs,  pressure  can  be  built  up  below  the  larynx.  This  pressure  eventually  pro- 
vides energy  for  the  speech  signal. 

Immediately  above  the  trachea  is  the  larynx,  which  constitutes  the  second  group  of  struc- 
tures essential  to  the  production  of  speech.  The  vocal  cords  in  the  larynx  can  be  positioned  in 
many  ways  so  that  air  can  flow  through  the  glottis  either  with  or  without  setting  the  vocal  cords 
into  vibration.  When  the  vocal  cords  are  set  into  vibration,  the  airflow  through  the  glottis  is 
interrupted  quasi-periodically,  thus  creating  the  effect  of  modulation. 

The  third  set  of  structures  consists  of  the  tongue,  jaw,  lips,  velum,  and  other  components 
that  form  the  vocal  and  nasal  cavities.  By  changing  the  configuration  of  the  vocal  tract,  one 
can  shape  the  detailed  characteristics  of  the  speech  sounds  being  produced. 

It  is  convenient  to  describe  the  acoustics  of  speech  production  in  terms  of  three  distinct 
stages.  First,  through  interaction  between  airflow  from  the  lungs  and  the  laryngeal  and  supra- 
glottal  structures,  a source  of  acoustic  energy  is  created.  This  acoustic  source  may  he  one  of 
several  types,  and  may  have  several  possible  positions.  The  source  acts  as  the  excitation  for 
the  cavities  above  and  below  it.  The  filtering  that  is  imposed  on  the  source  by  these  cavities 
is  the  second  stage  in  the  generation  of  sjxiech  sounds.  Finally,  sound  is  radiated  from  the 
lips  and/or  the  nostrils. 


! 


i 


Fig.  1-2.  Spectrograms  of  the  words  "tea,"  "steep,"  and  "tree."  (Spectral  and  temporal 
characteristics  of  the  stop  release  are  modified  by  the  phonetic  environment.) 


2 


1.2  LINGUISTIC  FRAMEWORK 


Studies  of  the  way  language  is  organized  have  produced  overwhelming  evidence  that  under- 
lying the  production  and  perception  of  speech  there  exists  a sequence  of  basic  discrete  segments 
that  are  concatenated  in  time.  These  segments,  called  "phonemes,"  are  assumed  to  have  unique 
articulatory  and  acoustic  characteristics.  It  has  been  proposed  by  Jakobson,  Fant,  and  Halle1 
that  the  phonemes  can  be  characterized  by  a set  of  invariant  attributes  called  distinctive  fea- 
tures. The  distinctive  features  bear  a direct  relationship  to  the  articulatory  gesture  from  which 
the  speech  sound  is  produced,  and  they  have  certain  well-defined  acoustic  correlates.  There- 
fore, at  the  phonemic  level,  the  linguistic  structure  of  an  utterance  can  be  represented  by  a 
two-dimensional  matrix  with  columns  representing  the  phonemes,  rows  listing  the  distinctive 
features,  and  the  matrix  entries  indicating  the  presence  or  absence  of  a feature  for  a given 
phoneme. 

It  should  be  noted  that  at  the  phonemic  level,  the  distinctive  feature  theory  necessitates  a 
discrete  (or  even  binary)  selection,  whereas  at  the  articulatory  and  acoustic  levels,  the  feature 
correlates  appear  to  take  on  a continuum  of  values. 

1.3  PROBLEM  AREAS 

During  the  production  of  speech,  the  linguistic  contents  of  the  feature  matrix  are  trans- 
formed into  actual  neuromuscular  commands  that  set  the  articulators  (lips,  jaw,  tongue,  etc.) 
into  motion.  Although  the  commands  may  be  discrete,  or  stepwise,  the  actual  motions  of  the 
articulators  and  the  resulting  acoustic  signals  are  continuous,  due  to  the  interaction  among  var- 
ious structures  and  their  different  degrees  of  sluggishness.  The  result  is  an  overlap  of  phonemic 
information  from  one  segment  to  another.  In  other  words,  although  the  features  have  certain 
well-defined  acoustic  correlates,  there  is  hardly  a one-to-one  correspondence  between  a given 
feature  and  its  correlates.  More  precisely,  the  acoustic  manifestation  of  a given  feature  ap- 
pears to  depend  on  the  presence  or  absence  of  other  features.  Furthermore,  when  phonemes 
are  concatenated  to  form  an  utterance,  the  acoustic  correlates  of  the  underlying  features  will 
undergo  modification  and  distortion  as  a consequence  of  the  phonetic  environments.  For  exam- 
ple, Fig.  1-1  shows  spectrograms  of  the  two  words  "do1*  and  "boo."  Although  the  vowels  in  the 
two  words  are  phonemically  identical,  the  acoustic  characteristics  of  the  vowels  can  be  seen  to 
be  quite  different. 

Similarly,  one  can  clearly  observe  the  differences  (both  in  temporal  and  in  spectral  char- 
acteristics) of  the  phoneme  /t/  in  three  different  words  "tea,"  "steep,"  and  "tree,"  as  shown  in 
Fig.  1-2. 

These  examples  illustrate  the  important  fact  that  in  any  study  of  the  acoustic  properties  of 
speech  sounds,  the  influence  of  the  phonetic  environment  must  be  taken  into  account. 

Another  important  feature  of  speech  communication  is  that  sometimes  a speaker  can  distort 
the  acoustic  properties  of  speech  sounds  so  severely  that  even  the  environment  will  provide  no 
acoustic  cues  to  the  identity  of  the  phoneme.  Figure  1-3  provides  examples  of  such  acoustic 
distortion.  The  schwa  in  the  second  syllable  of  the  word  "multiply"  and  the  first  syllable  of  the 
word  "display"  can  be  devoiced  such  that  it  exhibits  no  acoustic  characteristics  commonly  asso- 
ciated with  vowels.  Such  distortion  is  possible  because  a listener  is  capable  of  decoding  an 
utterance  not  only  from  the  acoustic  signal,  but  also  from  his  familiarity  with  the  syntactic  and 
semantic  constraints,  and  with  rules  governing  allowable  phoneme  sequences  of  his  language. 


"pjnjfMl 


I* H H H 

MULTIPLT  THE  NUMBERS  AND  PISPLaT  THE  RESULT 


Fig.  1-3.  Spectrogram  of  the  sentence  "multiply  the  numbers  and  display  the  result" 
(example  of  schwa  deletion  as  a consequence  of  phonological  effect). 

The  above  examples  serve  to  emphasize  the  fact  that  although  there  might  be  invariant 
acoustic  cues  for  a distinctive  feature,  the  surface  realization  of  such  an  acoustic  cue  is  very 
much  intermingled  with  other  cues.  Only  through  a very  careful  and  controlled  study  can  one 
satisfactorily  answer  the  question  of  the  acoustic  invariance,  if  any,  of  phonetic  features. 

Another  important  factor  contributing  to  our  lack  of  success  in  relating  acoustic  charac- 
teristics to  phonetic  features  is  the  variability  from  one  speaker  to  another.  The  acoustic 
properties  of  speech  sounds  depend  on  the  physiological  structure  of  the  vocal  apparatus,  which 
varies  from  speaker  to  speaker.  Furthermore,  given  a single  speaker,  the  same  utterance 
pronounced  on  two  separate  occasions  could  vary  considerably.  In  any  study  of  the  acoustic 
properties  of  .speech  sounds,  these  inter-  and  intraspeaker  variations  will  have  to  be  accounted 
for.  This  requirement  usually  translates  into  multiple  speaker  and  multiple  session  analysis, 
which  in  turn  suggests  a large  corpus  of  data. 

1.4  LITERATURE  REVIEW 

The  acoustic  characteristics  of  stops  and  the  effect  of  coarticulation  have  been  studied  by 
many  in  the  past,  and  a number  of  the  studies  date  back  some  twenty  years.  Most  of  these  stud- 
ies are  directed  toward  the  search  for  perceptually  important  acoustic  cues,  and  each  study 
achieved  a varying  degree  of  success.  Due  to  various  technical  difficulties  involving  the  pro- 
cessing and  the  storage  of  a large  amount  of  data,  most  of  these  studies  have  been  rather  limited 
in  scope. 

The  pioneering  work  at  Haskins  Laboratory  represents  the  early  research  for  perceptually 
important  acoustic  cues  of  stops  (Cooper  et  aL,2  Delattre  et  al.^).  Using  the  pattern  playback 
machine  that  converts  hand-painted  spectrograms  into  sounds,  the  investigators  were  able  to 
vary  independently  the  frequency  location  of  the  stop  burst  and  the  amount  and  direction  of  the 
second  formant  transition  into  the  following  vowel.  Their  major  finding  has  been  that  burst 
frequency  and  the  direction  and  degree  of  formant  transitions  are  perceptually  important  cues 
to  the  identification  of  the  stop  consonants.  These  findings  prompted  the  subsequent  proposition 
of  the  existence  of  acoustic  loci  for  these  consonants.  Delattre  et  aL  speculated  on  the  associ- 
ation of  different  formant  loci  with  the  place  and  manner  of  articulation  of  these  consonants. 


4 


4 

Concurrent  with  the  Haskins  studies,  Fischer-Jorgensen  reported  a study  of  the  acoustic 
properties  of  the  six  Danish  stops.  Similar  but  different  results  were  found  with  regard  to  the 
formant  loci  of  these  consonants.  The  author  noted  the  significant  role  played  by  the  aspiration 
which  serves  to  differentiate  the  Danish  /p,t,k/  from  /b.d.g/.  It  was  also  found  that  vowel  en- 
vironment tended  to  alter  the  acoustic  properties  of  stops. 

Spectral  characteristics  of  English  stop  consonants  were  studied  by  Halle  et  aL5  Quantita- 
tive data  were  gathered  and  possible  criteria  for  identification  were  proposed  and  tested.  The 
study  asserts  that  burst  and  transitions  are  the  two  major  cues  for  the  identity  of  these  conso- 
nants, although  the  authors  took  issue  with  the  "locus  theory"  proposed  by  the  Haskins  group. 

It  was  felt  that  a set  of  more  complex  rules  seemed  to  operate  on  the  formant  transitions. 

Lisker  and  Abramson^  reported  a cross-language  study  of  voicing  in  initial  stops.  The 
major  finding  of  this  study  is  that  the  features  of  voicing,  aspiration,  and  force  of  articulation 
could  be  plausible  consequences  of  a single  variable  - voice-onset  time  (VOT).  VOT  was  found 
to  be  not  only  a basis  for  separating  the  voicing  categories,  but  also  sensitive  to  the  place  of 
stop  closure. 

Subsequent  studies  by  Lisker  and  Abramson7  extended  the  results  to  other  phonetic  environ- 

g 

ments.  The  study  by  Klatt  was  the  first  reported  investigation  of  the  variation  of  VOT  in 
clusters. 

In  summary,  a good  deal  of  information  with  regard  to  the  acoustic  characteristics  of  stop 
consonants  has  been  published  in  the  literature.  Although  each  of  these  studies  had  contributed 
individually  to  our  understanding  of  the  acoustic  events  of  speech,  the  results  are  fragmental 
and  lack  continuity.  They  all  suffer,  in  one  way  or  another,  from  some  of  the  problems  stated 
in  the  previous  section. 

1.5  OVERVIEW 

In  Chap.  2,  we  define  the  research  goal  of  this  report,  and  describe  the  data  collection  pro- 
cedure. Chapter  3 is  devoted  to  the  description  of  the  analysis  system,  as  well  as  the  data-base 
facility.  Chapter  4 presents  results  and  discussion  of  the  temporal  characteristics  of  English 
stops,  both  in  isolation  and  in  clusters.  The  spectral  characteristics  of  these  stops  are  pre- 
sented in  Chap.  5. 


5 


CHAPTER  2 

DEFINING  THE  RESEARCH  GOA I. 
AND  THE  DATA  COLLECTION  PROCESS 


One  long-range  goal  of  (he  research  in  acoustic  phonetics  is  to  determine  the  relevant 
acoustic  properties  of  all  speech  sounds  and  to  relate  these  properties  to  the  underlying  features 
that  characterize  the  sounds.  As  was  pointed  out  in  the  previous  chapter,  the  nature  of  speech 
as  a communication  medium  makes  this  goal  a formidable  one.  This  report  deals  with  a prob- 
lem of  much  reduced  magnitude  in  that  the  number  of  speech  sounds  studied  is  rather  limited, 
and  the  phonetic  environment  and  linguistic  and  phonological  influences  are  carefully  controlled. 

This  chapter  first  describes  the  scope  of  the  study  in  detail.  Considerations  that  went  into 
the  design  of  the  data  format  are  then  presented,  followed  by  a description  of  the  acquisition  of 
acoustic  data. 

2.1  SCOPE  OF  THE  STUDY 

The  study'  reported  here  has  two  distinct  and  integral  parts.  The  first  part  involves  the 
collection  of  a large  corpus  of  acoustic  data,  as  well  as  the  development  of  a highly  interactive 
computer  facility  where  a substantial  a mount  of  acoustic  data  can  be  stored,  examined,  and 
analyzed.  The  second  part  utilizes  the  facility  to  study  a subset  of  the  collected  data,  namely, 
the  English  stops,  both  in  singleton  and  in  clusters,  preceding  stressed  vowels. 

It  was  decided  at  the  outset  that  this  study  would  be  directed  primarily  toward  prestressed 
consonants  and  consonant  clusters.  Data  collection  and  the  design  of  the  data  format  conse- 
quently reflect  such  a constraint.  We  have  restricted  ourselves  to  the  study  of  prestressed  con- 
sonants (and  clusters)  for  several  reasons.  First,  a stressed  conson  m* -vowel  (C-V)  sequence 
is  universal  among  all  known  languages.  Studies  of  prestressed  consonants  can  provide  a com- 
mon ground  on  the  basis  of  which  cross -language  differences  can  be  compared.  Secondly, 
stressed  syllables  in  an  utterance  are  probably  articulated  with  greater  care  and  effort,  thereby 

resulting  in  a robust  acoustic  signal  where  parameters  can  be  extracted  more  reliably.  Further- 

9 10 

more,  based  on  some  early  studies  (Stevens  arid  Klatt,  Stevens  1,  it  may  be  hypothesized  that 
the  intrinsic  acoustic  properties  of  consonants  are  least  distorted  by  the  environment  when  they 
appear  in  stressed  C-V  syllables.  Acoustic  invariants  are  more  likely  to  be  observed  in  such 
an  environment.  Therefore,  this  phonetic  environment  might  provide,  in  some  sense,  the 
clearest  indication  of  the  ideal  relationship  between  the  underlying  features  and  their  acoustic 
correlates. 

Although  inter-  and  intraspeaker  variabilities  are  not  the  major  concerns  of  this  study,  the 
data-base  was  to  include  a number  of  talkers,  each  recorded  on  several  occasions.  The  inclu- 
sion of  several  speakers  and  sessions  presumably  minimizes  the  probability  of  an  individual 
speaker  or  recording  session  introducing  a bias  in  the  results. 

In  conjunction  with  the  collection  of  acoustic  data,  an  acoustic  analvsis  system  was  devel- 
oped to  enable  the  preliminary  processing  of  acoustic  data.  Digitization  of  the  speech  wave- 
fjrm,  computation  of  the  short-time  spectra,  and  the  hand-marking  of  utterances  with  phonetic 
labels  can  all  be  done  conveniently.  The  facility  was  also  developed  such  that  utterances  were 
stored  in  a data-base  where  retrieval  and  analysis  could  be  done  with  ease.  Various  aspects 
of  the  analysis  system  and  data-base  will  be  discussed  in  great  detail  in  Chap.  5. 


✓ 


7 


\ 


PFECEDUC  Pa qe  bunk-not  filmed 


I sing  the  facility  developed,  the  acoustic  characteristics  of  prestressed  English  plosives, 
both  in  singleton  and  in  clusters,  have  been  studied  in  detail.  Durational  and  spectral  charac- 
teristics of  thi  plosives  were  examined  as  a function  of  the  phonetic  environments,  and  possible 
interpretations  of  the  data  in  connection  with  production  and  perception  were  suggested.  These 
results  are  presented  in  Chaps.  ■)  and  5. 

2.2  DAT*  FORMAT 

Several  considerations  have  been  weighed  in  the  design  of  the  data  format.  We  have  thus 
far  limited  ourselves  to  stressed  C-V  syllables  where  the  consonant  (or  consonant  cluster)  and 
vowel  are  varied  in  some  systematic  way.  Experience  has  shown  that  stressed  C-V  syllables 
in  isolation  can  be  articulated  in  a rather  unnatural  manner.  It  is,  therefore,  advantageous  to 
frame  the  C-V  syllable  in  a carrier  sentence  so  as  to  simulate  a more  natural,  continuous- 
speech-like  environment.  Since  certain  English  vowels  do  not  appear  in  syllable  final  positions, 
a final  consonant  was  added  to  the  C'-V  syllable.  Finally,  we  would  like  to  eliminate  as  much 
as  possible  the  linguistic  and  phonological  influences.  The  last  criterion  is  important  because 
a speaker  can  sometimes  distort  the  acoustic  properties  of  a phonetic  segment  so  severely  that 
even  the  environment  will  provide  no  cues  to  the  identity  of  the  segment.  Such  a distortion  is 
possible  because  a listener  is  capable  of  decoding  such  an  utterance  not  only  from  the  acoustic 
signal,  but  also  from  his  familiarity  with  the  syntactic  and  semantic  constraints,  and  with  the 
rules  governing  allowable  phoneme  sequences  of  his  language.  In  order  to  study  the  acoustic 
characteristics  of  speech  sounds,  it  is  certainly  desirable  to  minimize  such  higher-level 
influences. 

The  format  of  the  data  was  finally  decided  to  be  a nonsense  word  ha'CVC  embedded  in  a 

carrier  sentence,  "Say again."  The  prestressed  consonants  and  consonant  clusters,  the 

vowels,  and  the  poststressed  consonants  that  were  included  in  our  data  collection  are  listed  in 

Table  2-1.  They  included  the  IS  vowels  and  diphthongs  in  English  and  essentially  all  allowable 

word  initial  consonants  and  consonant  clusters  in  English.  The  ha'CVC  nonsense  word  format, 

inciden'ally,  had  been  used  previously  bv  other  researchers  (House  and  Fairbanks,1'  Stevens 
12  13 

and  House,  Stevens  et  al_.  ) to  study  the  acoustic  properties  of  vowels  and  consonants. 

2.3  ACQUISITION  OF  ACOUSTIC  DATA 

All  the  acoustic  data  were  recorded  in  a soundproof  room  where  the  signal-to-noise  ratio 
is  above  SO  dB.  An  Altec  084R  microphone  was  used  in  conjunction  with  a Presto  model  800 
tape  recorder  for  the  recording.  The  microphone  was  suspended  from  the  ceiling  and  was 
placed  approximately  2 inches  above  the  subject's  upper  lip  and  10  to  12  inches  in  front  of  the 
subject.  After  being  seated  in  front  of  the  microphone,  the  subject  was  then  asked  to  frame  the 
utterance  in  a carrier  sentence  and  read  out  loud.  A sample  of  the  list  of  utterances  is  included 
in  Table  2 - 1 1 . Prior  to  the  actual  recording,  the  subject  was  asked  to  read  aloud  for  approxi- 
mately one  minute  so  as  to  allow  adjustment  of  recording  level  and  reading  speed;  speaking  rate 
was  roughly  maintained  at  5 syllables  per  second  within  each  sentence-like  utterance.  The 
subject  was  asked  to  speed  up  or  slow  down  without  explicit  knowledge  of  the  fact  that  speaking 
rate  was  being  controlled.  During  the  recording  session,  the  person  operating  the  recorder 
monitored  the  utterances  through  a set  of  headphones.  Communication  between  him  and  the 
subject  was  achieved  via  signs  displayed  on  the  large  window  between  the  soundproof  room  and 
the  room  where  the  recorder  was  situated.  The  subject  was  asked  to  pause  after  each  list,  at 


8 


TABLE  2-1 


LIST  OF  ALL  THE  CONSONANTS, 
CLUSTERS,  AND  VOWELS  USED 


Singleton 

Consonant 

i 

2-Element 

Clusters 

T 

3-Element 

Clusters  Vowel 

P 

pl  5r 

s p r j i 

t 

pr  6r 

sp  1 I 

k 

t r 

st  r e 

b 

t * 

skr  t 

d 

k 1 

9t 

9 

k r 

a 

m 

V w 

A 

n 

bl 

0 

s 

br 

0 

i 

d r 

U 

t 

d« 

u 

(j 

s' 

} 

2 

9f 

ay 

5 

g w 

oy 

V 

1 1 

a w 

f r 

1 

sp 

r 

St 

w 

sk 

y 

sm 

sn 

si 

s« 

, 

TABLE  2-11 

SAMPLE  UTTERANCE  LIST 


h » * y u t 
h © ’ y u t 
L 2 h a * y 1 1 
L 3 ho*  y ay t 
L 4 ho ' y 3y t 
ho ' yaw  t 


2301  ho’pClt 

2302  ho’pfixt 

2303  ho  * p f e t 

2304  ho’pBct 
23  05  ho  * p f »t 

2306  h o * p £ a t 

2307  ho'pfiAt 
2300  ho* p Rot 

2309  ho'pfiot 

2310  h o * p £ u t 

2311  ho’pJut 

2312  ho  ‘ pHt 

2313  ho  * p Ray t 

2314  ho’pfioyt 
h o * p P.  a u t 


24 


24 


24 


24 


24 


24 


24 


24un 


2 4 0 9 ho 
24  10  ho 
24  11  ho 

2412 

2413  ho 


fJ  « u 


prat 


prut 


prut 


prayt 


5 


2315 


2 


which  time  he  was  asked  to  repeat  those  utterances  that  were  erroneous.  An  entire  recording 
session  lasted  approximately  45  minutes,  marked  by  one  or  two  rest  periods.  The  subject  pre- 
ceded each  utterance  with  a preassigned  identification  number  as  shown  in  Table  2-11.  This 
number  contains  i omplete  information  of  the  phonetic  context.  Tor  example,  the  number  2411 
can  be  decoded  to  be  a i luster  pr  , to  be  followed  by  the  vowel  A . Therefore,  at  the  time 
when  phonetic  context  is  actually  entered  into  the  data-base,  the  user  needs  only  to  refer  to 
this  number  whi  It  the  computer  will  decode  automatically.  This  method  eliminates  the  confus- 
ing and  sometimes  diffi<  ult  task  of  deciding  trie  phonetic  context  from  human  perception  of  the 
utterance. 


11 


CHAPTER  3 

ANALYSIS  SYSTEM  AND  DATA-BASE  FACILITY 


One  of  the  major  aspects  of  this  study  is  the  development  of  a good  acoustic  analysis  system. 
It  is  very  desirable  to  have  an  analysis  procedure  where  acoustic  changes  can  be  monitored  as 
closely  as  possible,  sinc  e acoustic  characteristics  of  speech  sounds  i an  change  signific  antlv 
within  a few  milliseconds.  To  monitor  the  rapid  spectrum  changes  at  the  onset  of  the  release 
of  a plosive,  for  example,  would  require  an  analysis  technique  more  sophisticated  than  a < on- 
ventional  filter  bank  or  sound  spectrogram.  Furthermore,  in  order  to  process  such  a large 
corpus  of  data  in  a reasonable  amount  of  time,  and  to  provide  the  user  with  relative  case  of  data 
access  and  retrieval,  the  processing  and  storage  capabilities  of  the  analysis  system  must  be 
taken  into  consideration. 

This  chapter  first  gives  an  account  of  the  signal-processing  aspects  of  our  present  study, 
with  emphasis  on  a particular  signal-processing  technique  called  linear  prediction.  The  com- 
puter facilities  are  discussed  next.  We  then  describe  the  data-base  facility  in  detail,  giving 
examples  of  its  capability.  Procedures  through  which  acoustic  data  are  entered  into  the  data- 
base are  also  included  in  this  chapter. 

5.1  AC  Ol  STIC  ANALYSIS  PRCX  E IMRE 

Digital  computers  and  digital  signal-processing  techniques  have  been  chosen  to  perform  our 
acoustic  analysis.  This  choice  offers  manv  advantages,  such  as  great  flexibility,  large  data- 
storage capability,  and  accuracy. 

While  certain  acoustic  events  can  be  monitored  conveniently  in  the  time  domain,  experi- 
ence has  shown  that  frequency-domain  representation  of  the  speech  signal  often  provides  greater 
insights  into  the  relationship  between  the  articulatory  and  the  acoustic  realization  of  speech. 

For  example,  spectral  peaks  in  non-nasal i zed  vowels  can  be  quite  reliably  correlated  with  the 
resonances  of  the  vocal  tract,  and  the  frequency  location  of  the  major  energy  concentration  in 
a plosive  release  gives  good  indications  about  the  location  of  the  constriction  in  the  vocal  tract. 

It  is,  therefore,  often  desirable  to  compute  short-time  spectra  of  the  signal. 

After  experimenting  with  various  methods  of  computing  and  smoothing  the  short-time 
spectra,  we  have  chosen  to  compute  the  spectra  via  a speech-analysis  procedure  known  as 
linear  prediction.  The  theory  and  limitations  of  linear  prediction  analysis  will  now  be  presented. 

5.1.1  Linear  Prediction 

Detailed  treatment  of  the  various  formulations  of  linear  prediction  analysis  can  be  found  in 
the  literature  (Atal  and  Ilanauer,'^  Markel,'"’  Makhoul  and  Wolf'  1.  We  shall  elaborate  on  one 
of  these  formulations,  which  is  commonly  referred  to  as  the  covariance  formulation.  We  hate 
chosen  to  discuss  this  formulation  primarily  because  the  relevance  of  linear  prediction  analysis 
to  the  speech  signal  is  most  apparent  this  way. 

Linear  prediction  analysis  is  based  on  the  speech  production  model  shown  in  Fig.  5-1.  The 
all-pole  digital  filter  ll(z)  represents  the  combined  effect  of  the  glottal  source,  the  vocal  trai  t, 
and  the  radiation  losses.  In  this  idealized  model,  the  filter  is  excited  either  by  a periodic  im- 
pulse train  for  voiced  speech  or  random  noise  for  unvoiced  speech. 


1 3 


VOICCO 


11_L_ 


x(n) 

»(n) 

*•1 

Fig.  3-1.  All-pole  model  of  speech  production. 


The  speech  production  model  can  he  equivalently  characterized  bv  the  difference  equation 
P 

s(n)  \a(k)s(n-k)+x(n)  (3-11 

k I 

where  s(n>  and  \(n)  are  the  n’1'  samples  of  the  output  speech  wave  and  the  excitation,  respec- 
tively The  a(k)'s  are  the  coefficients  characterizing  the  filter  ll(z),  and  henceforth  will  he 
referred  to  as  the  predictor  coefficients. 

From  Kq.  (3-1  ) it  is  clear  that  one  can  determine  the  a(k)'s  if  the  input  and  2p  consecutive 
values  of  s(n)  are  known,  with  the  first  p of  these  values  serving  as  initial  conditions.  We  shall 
restrict  the  following  discussion  to  voiced  speech  for  which  the  input  is  a periodic  impulse  train. 
In  'his  case,  the  a(k)'s  can  be  determined  with  the  knowledge  of  only  2p  consecutive  values  of 
s(n)  and  the  position  of  the  impulse.  For  this  idealized  model,  we  ran  define  the  predicted 
value  of*s(n)  as 

P 

s(n)  ^ a(k)s(n-k)  . (3-21 

k 1 

The  difference  between  s(nl  and  s(n>  will  be  zero  except  for  one  sample  at  the  beginning  of  each 
period. 

In  reality,  however,  s(n)  is  not  produced  by  this  highly  idealized  model  and,  therefore, 
prediction  of  stn)  based  on  Kq.  (3-21  will  introduc  e error.  If  we  are  to  approximate  s(n>  bv  §(n) 
as  defined  in  Eq.  (3-2),  the  a(k)'s  can  be  determined  only  with  specification  of  the  error  c riterion. 

We  c an  choose  to  determine  the  predictor  coefficients  by  minimizing  the  sum  of  the  squared- 
difference  between  s(n)  and  s(n),  that  is,  by  minimizing 

N-l 

E ^ [s(n)  — S (n)  | ^ (3-3) 

n 0 

where  the  minimization  is  to  be  carried  out  over  a section  of  s(n)  of  length  N. 

The  minimum  mean-squared  error  criterion  is  chosen  over  other  criteria  because  the  de- 
termination of  the  a(k)'s  now  reduces  to  the  solution  of  the  following  set  of  linear  equations: 

P 

\ a(k)  v>(i,k)  <p(i,0)  i - t,2,..,p  (3-4) 

kH 


14 


3 


4 


where 


N-i 

V>(i,j)  s(n-j)s(n-j)  i,j  1,2, ...p  . (1-5) 

n=0 

Equation  (3-4)  ran  be  written  in  matrix  form  as 

<t>  a = jc  (3-6) 

where  4 is  a p by  p matrix  with  typical  element  v>(i,  j);  a and  x_  are  p-dimensional  vectors 
with  the  i1*1  component  given  by  a(i)  and  «>(i,  0),  respectively.  The  solution  of  the  matrix  equa- 
tion is  greatly  simplified  by  the  fact  that  the  matrix  is  symmetric  anti,  hence,  recursive  proce- 
dures are  applicable  (Eaddeeva  ). 

It  is  of  interest  to  compare  the  analysis  procedure  outlined  above  for  two  different  cases. 

If  the  fundamental  frequency  of  voicing  (E0)  is  known  in  advance,  the  analvsis  can  be  carried 
out  directly  in  the  sense  that  Eq.  (3-5)  can  be  evaluated  exactly.  In  practice,  however,  it  is 
very  desirable  to  carry  out  the  analysis  without  a priori  knowledge  of  E0.  In  this  case,  an  ap- 
proximation has  to  be  made  and  additional  error  is  introduced.  We  shall  illustrate  this  point 
by  a simple  example.  The  argument  can  easily  be  generalized  to  include  more  complicated 
situations. 

Let  us  assume  that  there  is  only  one  pitch  pulse  present  in  the  data  and  that  it  occurs  at 
n - m.  If  m is  known,  then  Eq.  (3-5)  can  be  evaluated  as 

N-1 

v>(i,  j)  = ^ s(n  — i)s(n-j)  . (3-7) 

n=0 

n^m 

Equation  (3-5)  cannot  be  evaluated  explicitly,  however,  if  m is  unknown. 

If  we  choose  to  approximate  «>(i,  j)  by 

N-1 

<?(i,j)-  2/  s(n-i)s(n-j)  . (3-8) 

n = 0 

Comparing  Eqs.  (3-7)  and  (3-8),  we  find  that  the  error  in  <r(i,  j)  is  given  by 

c (i,  j)  tg(i,  j)  — <p(i,  j)  s(m  - i)  s(m  - j)  . (3-9) 

By  the  nature  of  the  speech  signal,  s(m  - i)  and  s(m  - j)  are  small  compared  with  samples  at 
the  beginning  of  each  period.  Therefore,  the  error  e(i,  j)  is  small  compared  with  «>(i,  j)  for  any 
reasonable  N.  Results  comparing  the  two  analvsis  procedures  will  be  presented  in  a later 
section. 

The  theory  of  linear  prediction  has  also  been  formulated  in  a slightly  different  way.  I.et 
e(n)  denote  the  output  of  the  inverse  filter  II  *(z)  when  it  is  excited  by  s(n).  If  we  choose  to 
determine  the  a(k)'s  by  minimizing  the  total  energy  in  e(n),  the  set  of  equations  obtained  can 
be  shown  to  be  almost  identical  to  Eq.  (3-4)  and  (3-5)  (Markel1  'S.  The  major  difference  in  the 
result  is  tliat  the  matrix  ip  in  this  second  formulation  is  of  Toeplitz  form 

*>(i.j)  = jl.O)  (3-10) 


15 


1 8 

so  that  the  matrix  equation  ran  bo  solved  bv  still  more  efficient  algorithms  (I.evinson  ).  The 
second  formulation  is  sometimes  referred  to  as  the  autocorrelation  formulation. 

From  the  predictor  coeffic  ients,  the  approximated  spectral  envelope  of  s(n)  can  be  com- 
puted as  life-1  )!.  Note  that  the  unit  sample  response  of  the  inverse  filter  II'  (z)  is  given  bv 


1 /a(0>  n 0 

bin)  { — a(n)/aiO)  n 1,2,..,p 

0 otherwise 


Therefore,  ll(e,u')  can  be  obtained  efficiently  by  computing  the  discrete  Fourier  transform  of 
h(n)  with  a fast  Fourier  transform  algorithm,  and  then  inverting  the  result. 

I inear  prediction  analysis  assumes  a specific  speech  production  model  where  the  combined 
effect  of  the  glottal  excitation,  the  vocal  tract,  and  the  radiation  losses  is  represented  bv  an 
all-pole  filter.  Whether  such  a model  is  adequate  for  speech  analysis  leaves  room  for  contro- 
versy. It  is  well  known  that  for  all  non-nasalized  sonorants,  the  transfer  function  of  the  vocal 
tract  has  onlv  poles.  For  fricatives  and  nasals,  however,  the  transfer  functions  have  zeros  as 
well  as  poles.  Added  to  the  problem  is  the  fact  that  glottal  peculiarities  could  sometimes  intro- 
duce zeros.  Therefore,  from  a theoretical  standpoint,  the  all-pole  model  is  not  always  ade- 
19 

quate.  However,  Atal  has  shown  that  these  zeros  are  inside  the  unit  circle,  and  thus  can  be 
approximated  by  a number  of  poles  via  Taylor  series  expansion.  IMakhoul  and  Wolf1*  have 
shown  that  linear  prediction  analysis  c an  be  viewed  as  a method  of  analysis-bv-synthesis  where 
the  number  of  poles  is  specified  and  the  result  is  a good  fit  to  the  envelope  of  the  short-time 
spectrum.  Therefore,  within  the  context  of  spectrum  analysis,  the  all-pole  model  c an  be  quite 
adequate.  It  is  when  one  attempts  to  associate  these  poles  with  the  genuine  transfer  function  of 
the  vocal  tract  that  the  all-pole  assumption  needs  to  be  reexamined.  Kxperimental  findings, 
with  both  synthetic  and  natural  speech,  have  shown  (Zue20)  that  linear  predic  tion  analysis  c ap- 
tures the  essential  spectral  characteristics  of  nasals  and  fricatives. 

The  number  of  poles,  p,  is  determined  by  the  sampling  frequency,  as  well  as  our  knowl- 
edge of  the  speech  production  mechanism.  For  example,  there  exist,  in  general,  9 to  6 com- 
plex pole  pairs  for  the  vocal  trac  t transfer  function  up  to  5 kHz.  Adding  two  real  poles  for  the 
combined  effect  of  the  glottal  source  and  the  radiation  losses,  we  arrived  at  a value  of  p be- 
tween 12  and  1-1  for  the  speech  signal  sampled  at  10  kHz.  For  those  with  zeros  in  the  transfer 
function,  a higher  value  of  p would  be  desirable. 

As  mentioned  earlier,  there  are  several  formulations  of  the  linear  prediction  technique 
that  are  closely  related  but  have  inqxirtant  theoretical  differences.  The  set  of  issues  has  been 
explored  in  great  detail  elsewhere  (Portnoff  et  al_..21  Makhoul  and  Wolf1*  ).  However,  when  ap- 
plied to  a complete  speech  analysis -synthesis  system,  comparable  results  have  been  reported  in 

14  22 

the  literature  (Atal  and  llanauer,  Itakura  and  Saito  ).  We  have  implemented  both  the  autocor- 
relation and  the  c ovariance  methods  of  linear  prediction,  and  the  quantitative  differences  be- 
tween the  two  methods  were  found  to  be  minimum.  The  autocorrelation  method  has  the  advan- 
tage of  smoother  spectral  variation  from  frame1  to  frame,  a direct  consequence  of  windowing 
the  speech.  Also,  computationally,  the  autocorrelation  method  is  very  efficient  when  increas- 
ing the  order  of  the  predictor  from  p to  p * 1,  as  is  probablv  required  for  nasals  and  nasalized 
vowels. 


lb 


:W 


1.1.2  Results  Comparing  Various  Spectrum  Analysis  Techniques 

Figure  1-2  compares  spectra  of  a synthetic  vowel  /a/  obtained  hy  various  spectral  smooth- 
ing techniques:  (a)  and  (hi  hv  windowing  (with  different  window  widths)  and  Fourier  transforming 
the  waveform,  (c)  by  cepstral  smoothing  (Oppenheim  and  Schafer  ),  and  (d)  hv  linear  predic  - 
tion. In  Fig.  1—  2 (a ) , the  effec  t of  glottal  periodicities  can  be  seen  as  the  ripples  superimposed 
on  the  spectral  envelope.  These  ripples  are  greatly  reduced  in  1 ig.  3-2(b)  bec  ause  of  the  spec- 
tral smearing  of  the  wide  frequency  window.  In  Fig.  3-2(c),  the  effect  of  the  glottal  exi  itation 
is  removed  by  a homomorphic  technique.  This  effect  is  also  removed  in  Fig.  l-2(d).  However, 
since  the  linear  prediction  analysis  is  based  on  a specific  speech  production  model  and  thus 
limits  the  number  of  spectral  peaks,  there  are  no  extraneous  peaks  in  Fig.  3—  2(d).  If  we  com- 
pare the  locations  of  the  spectral  peaks  with  the  actual  values  of  the  five  formants,  it  is  clear 
that,  for  this  example,  the  spectrum  derived  from  linear  predie*>on  provides  accurate  formant 
information. 

Figure  3-3  shows  the  spectra  for  the  same  vowel  obtained  by  linear  prediction,  except  in 
this  case,  the  analysis  is  carried  out  pitch-svnehronously.  Comparing  Figs.  3- 2 (d ) and  3-3, 
except  for  the  bandwidth  of  the  second  spectral  peak,  we  find  that  the  qualitative  difference  be- 
tween the  two  spectra  is  quite  small,  as  discussed  in  the  previous  section. 

Figure  3-4  displays  a linear  prediction  spectrogram  and  three  time  functions  derived  from 
the  same  speech  material.  It  c an  be  seen  that  the  Ri\TS  amplitude  measurements  derived  from 


I 1 I 1 1 L 


T- ; - Mtdl 


(a) 

51  2 msec 
HANNING  WINDOW 


Fig.  3-2.  Spectra  of  a synthetic  /a/  obtained 
by  various  analysis  techniques. 


U 

2 


<* 

8 


(bl 
12  8 

HANNING  WINDOW 

i i i i i i — i 

(e) 

30  m»ec 
HOMOMORPHIC, 
SMOOTHING 


(d) 

LINEAR  PREDICTION 

WITH  12  parameters 


0 Mr?  FJ  f«  F5  5 


FREQUENCY  (kHz) 

FI  • 

0 720  1HX 

F 2 • 

1 250 

SYNTHETIC  VOWEL  /o/ 

F3  • 

2 700 

F 4 . 

3 400 

F5  • 

4 300 

17 


— 


IT 


FREQUENCY  *h,» 


o 2 
o < 


FREQUENCY  (n Hi) 

El  • 0 720  fcMi 
E2*  V2S0 
E3«  2 700 
E4  • 3 400 
EV  4 300 


Fig.  Spectrum  of  a synthetic  /a/ 
obtained  bv  pitch- synchronous  linear 
prediction  analysis. 


1 


ZERO-CROSSING 

DENSITY 


RMS  AMPLITUDE 
FROM  WAVEFORM 


RMS  AMPLITUDE 
FROM  SPECTRA 


TIME  («e c» 


Fie.  ?-4.  I. inear  prediction  spectrogram.  (The  utterance  is  "sav 
ha'  dit  again.") 


1 8 


(b) 


(b) 


l ig.  3-5.  Spectra  of  a synthetic  /i / 
with  different  fundamental  frequen- 
cies: (a)  FO  = 200  Hz,  (b)F0  80  Hz. 


Fig.  3-6.  Spectra  of  a synthetic  . with 
different  fundamental  frequency  contours 
(linear  FO  ramps):  (a)  slope  800  11/  sc  . 
(b)  slope  400  Hz /sec.  Note  extra  peak 
in  (a). 


the  speech  signal  and  from  linear  prediction  smoothed  spectra  are  practically  identical.  The 
linear  prediction  spectrogram  also  bears  a distinct  resemhlance  to  conventional  spe.  trograms. 

One  property  of  linear  prediction  analysis  is  that  the  technique  is  relatively  insensitive  to 

1 4 

pitch  variations  (Atal  and  Hanauer  ).  Figures  3-5  and  -6  illustrate  this  point  with  synthetic 
speech  of  different  fundamental  frequency  contours.  When  the  fundamental  frequency  (F0)  ts 
held  constant  and  is  less  than,  say  200  llz,  the  analysis  technique  is  practically  insensitive  to 
the  value  of  F0.  When  F0  is  time  varying,  the  technique  shows  various  degrees  of  deterioration, 

depending  on  the  rate  of  F0  change  and  the  length  of  the  analysis  window,  N.  Atal  and 
24 

Schroeder  have  outlined  the  necessary  modifications  to  the  linear  prediction  analysis  proce- 
dure when  FO  becomes  extremely  high. 

Figures  3-7,  -8,  and  -9  compare  spectra  obtained  by  linear  prediction  with  those  obtained 
by  discrete  Fourier  transform  (DFT)  for  some  vowel  and  fricative  sounds.  The  spectral  match- 
ing property  of  the  linear  prediction  technique  is  illustrated  quite  clearly  in  these  figures. 


19 


I 


a 


Fig.  3-7.  Spec  tra  of  an  /a/  obtained  Fig.  3-8.  Spectra  of  an  i ' obtained 

by  (a)  I )FT  and  (b)  linear  prediction.  by  (a)  OFT  and  (Id  linear  prediction. 


10 


' 1-I10D  ] 


l it;.  1-9.  Spectra  of  an  s/  obtained 
by  (a)  1)1  T and  (b ) linear  prediction. 


Figure  1-10  contains  spectrographie  displays  of  an  utterance  where  the  spectra  are  obtained  bv 
linear  predic  tion  or  DPT.  Properties  of  both  techniques  are  self  apparent. 

1.2  COMPt  TKU  SYSTEMS 

Because  of  the  enormous  size  of  the  data-base,  the  processing  and  storage  of  data  consti- 
tute an  integral  part  of  the  research  problem.  To  perform  such  a complicated  analysis  as  lin- 
ear prediction  on  thousands  of  utterances  in  a reasonable  amount  of  time  requires  a fast  signal 
processor.  To  allow  storage  and  easy  access  to  a fair  number  of  utterances  needs  a computer 
with  good  size  storage  capacity.  Since  these  two  requirements  seem  to  be  orthogonal  in  the 
sense  that  no  single  computer  available  can  accommodate  both  of  these  ends,  two  different  com- 
puter facilities  were  eventually  used.  The  lnivac-1  DP  system  was  used  for  signal  processing 
and  the  TX-2  system  was  used  for  the  storage  and  on-line  data  analysis. 

1.2.1  lnivac-1  l)P  System 

The  signal  processing  part  of  the  data-base  utilizes  the  computer  facility  of  the  Digital  Pro- 
i essor  Group  at  M.l.T.  I.incoln  I.aboratory.  The  facility  includes  a I nivac  1219,  which  is  a 


kMt 


— ? — 13022 


12.0  mtec 
OFT 


!■  i ti • i-10.  Digital  spectrograms  obtained  by  various  analysis  techniques  and  parameters. 
(Serves  to  illustrate  the  differences  between  l)FT  and  linear  prediction,  and  the  trade-off 
between  time  ami  frequency  resolutions.) 


medium-size  general-purpose  computer,  the  fast  digital  proeessor  (FDP),  which  is  a very 
high-speed  programmable  signal  processor,  and  a number  of  peripheral  devices. 

The  I nivae  1319  is  an  18-bit,  one's-complement  digital  computer  with  32K  words  of  core 
memory  and  a < y<  le  time  of  2 nsec.  Peripheral  devices  include  a paper  tape  reader  and  punch, 
a drum,  two  tape  drives  for  7-track  magnetic  tapes,  A/D  and  D/A  converters,  and  two  CRT 
displays.  The  drum  has  a capacity  of  917K  words,  and  a maximum  data  rate  of  103K  words/sec. 
One  of  the  CRTs  is  a point  plotting  scope  where  time  waveform  and  spectrum  c an  be  displayed. 
The  other  display  system  generates  a 256-  by  256-point  raster  and  controls  the  brightness  of 
each  point  by  varying  the  intensity  of  the  electron  beam  of  the  CRT.  The  refresh  rate  is  ap- 
proximately 10  frames  'sec.  For  speech  research,  this  display  is  well  suited  for  generating 
spectrograms. 

Hie  I DP,  an  (8-bit,  two's-complement  programmable  signal  processor,  was  built  by  the 
Group  and  the  architecture  of  the  machine  was  designed  such  that  it  is  well  suited  for  signal- 
processing applications.  The  1 DP  has  two  independent  4096-word  data  memories,  a separate 
480-word  control  memory,  and  4 arithmetic  elements  each  with  its  own  multiplier  and  adder. 

The  machine  executes  two  18-bit  program  instructions  simultaneously  utilizing  instruction  oxer- 
lap  so  that  the  effective  double  instruction  cycle  time  is  150  nsec.  Effective  multiplication  time 
is  450  nsc  . Maximum  data  transfer  between  the  F DP  and  the  tint  vac  1219  is  1»  ' K words  sc  . 
The  T DP  also  has  a set  of  peripheral  devic  < s,  including  A/D  and  D/A  converters  and  a 166K- 
word  peripheral  ore  memory  with  a ycle  time  of  a little  over  3 psec. 

3.2.2  TX-2  System 

The  TX-2  computer  was  designed  and  built  at  I incoin  laboratory  during  1956-1059.  It  i- 
a 36-bit,  one's-complement.  single-addreaa  machine  with  indexing  and  multilevel  indexable 
indirect  addressing.  The  TX-2  1 as  t>  4K  words  of  memory  and  programs  are  run  under  the 
APFX  time-sharing  system  ' at  pr  o\  idt-s  ea>  h user  with  up  to  128K  of  virtual  address  spai  c. 
Major  peripheral  devices  iiv  lude  an  800-  Mb  drum,  magnetic  tape  transports,  A I)  and  I)  A 
converters,  and  various  display  facilities.  Each  time-shared  terminal  has  a set  of  displays 
and  a tablet  that  provide  a highly  interai  tive  environment  of  man-machine  communication.  In 
addition,  hard  copy  of  the  display  > an  he  obtained  from  a LDX  printer. 

3.3  PROCEDFRES  FOR  DATA  PROCESSING  AND  DATA-BASE  EACII.ITY 

Recorded  utterances  are  first  digitized  and  processed  on  the  Fnivac-FDP.  The  processing 
also  includes  the  relatively  time-consuming  task  of  manually  marking  the  phonetic  categories 
and  durations  of  all  segments.  Results  of  the  processing  are  then  stored  on  digital  magnetic 
tapes  and  entered  into  the  data-base  on  the  TX-2. 

3.3.1  I’rocessing  on  the  Fnivac-FDP 

Analog  speech  is  band-limited  to  5 kHz  and  sampled  to  12  bits  at  10  kHz.  1 he  sampling 
rate  was  chosen  because,  although  there  exists  interesting  acoustic  information  above  5 kHz  for 
some  speech  sounds,  most  of  the  essential  acoustic  features  are  below  5 kHz.  Increasing  the 
sampling  rate  proportionally  increases  storage  space  and  computation  time.  Increasing  the 
sampling  rate  also  requires  a higher-order  linear  predictor,  which  is  quite  an  undesirable  fea- 
ture for  many  reasons. 


Short-lime  spectra  can  be  obtained  either  through  Dl  T or  linear  prediction  analysis.  Both 
methods  operate  on  12. H msec  of  Hanning-windowed  speech,  an  I a new  spc<  'rum  is  ( omputo  I 
even  $ msec  . For  the  linear  prediction  analysis,  we  have  experimented  with  the  different  for- 
mulations of  the  technique  and  have  found  minimal  differences  in  the  results.  It  was  then  de- 
cided to  use  the  autocorrelation  method  for  the  reasons  dis'  ussed  earlier. 

With  the  help  of  the  various  display  programs,  the  utterance  is  hand-marked  with  segmental 
boundaries.  This  labeling  procedure  is  illustrated  in  l ig.  5-11.  The  user  mal  e -,  us.  of  the 


_ SMS  FME*Gr 
NASD ■ L ABELS 


. *■ 

11 

* * X 

w \ 

V 

_i 

„„  .. 

SK.NAl  PROCE.'j'jiMC 

AUDIO 

- \ 

HAND  *«9E  N(. 

playback 

T± 

Fig.  5-11.  I.abeling  procedure  on  the  I I )P. 

simultaneous  displays  of  the  digital  spectrogram,  RMS  amplitude  function,  time  waveform,  and 
spectral  cross  sections.  All  displays  are  time-synchronized  in  that  the  short  vertb  al  bar  im- 
mediately above  the  spectrogram  marks  the  position  corresponding  to  the  spectral  slice  and  the 
waveform.  Selected  portions  of  the  utterance  can  be  played  bac  k through  the  I)/ A conv  erter  to 
resolve  ambiguity.  The  level-coded  waveform  above  the  spectrogram  is  the  result  of  a typical 
labeling  session.  The  utterance  is:  "Say  ho'd  t again,"  anil  the  hand-laliel  marks  the  sc  hwa 
before  the  stop,  the  silence  and  release  of  the  prestressed  d , the  vowel,  the  silence,  release 
and  aspiration  of  the  poststressed  / 1 /,  and  the  following  schwa.  Only  the  schw a-CVC-schw a 
portion  of  the  utterance  is  actually  retained  on  the  computer. 

Hand-labeling  utterances  is  a tedious  task  that  also  involves  a lot  of  subjective  decisions. 

It  is  a well-known  fact  to  people  who  have  hail  experience  in  phonetic  labeling  that  a given  utter- 
ance can  be  labeled  quite  differently,  both  in  phonetic  content  and  acoustic  boundaries,  from 
one  person  to  another.  For  example,  the  Ixiundary  between  the  burst  and  the  aspiration  of  a 
voiceless  aspirated  stop  is  very  difficult  to  determine,  since  acoustically  there  is  a consider- 
able amount  of  overlap  between  these  two  phases  of  the  stop  release.  The  only  wav  to  overcome 
a problem  of  this  nature  is  for  one  to  lay  down,  in  advance,  a set  of  rules  to  deal  with  ambiguous 
situations,  and  to  adhere  to  these  rules  as  consistently  as  possible  when  labeling. 

After  an  utterance  is  hand-labeled,  the  digitized  waveform,  the  linear  prediction  spectra, 
the  RMS  amplitude  function,  and  the  level-coded  label  function  are  all  w ritten  into  the  digital 


24 


r 


magnetic  tape.  These  unformatted  tapes  are  then  read  into  the  1 X-2,  formatted,  and  written 
onto  data-base  tapes  for  later  use. 

3.  3.2  Data-Base  Facility 

The  speech  data-base  operates  under  the  Speec  h Proc  essing  Controller  (SIX')  (Stowe  ' ). 
SIX'  is  specifically  developed  for  the  storage  and  manipulation  of  a large  amount  of  speec  h da'a. 
The  SPC  and  the  essential  structure  of  the  data-base,  incidentally,  were  developed  in  connec  - 
tion with  the  speech  understanding  research  at  Lincoln  Laboratory.  Although  the  project  termi- 
nated in  1971,  the  structure  and  programs  remained  on  the-  TX-2  and  have  proved  to  be  quite 
useful • Kach  utterance  in  this  study  is  identified  in  the  TX-2  data-base  by  an  tt-chara*  ter 
string,  and  a field-type  (FT)  number.  The  characters  in  the  string  designate  the  recording 
session,  speaker  initials,  prestressed  consonant  (or  consonant  cluster),  and  poststressed  con- 
sonant. The  exac  t format  of  the  string  is  illustrated  in  Fig.  5-12.  The  FT  number,  ranging 
from  one  to  IS,  specifies  the  stressed  vowel  in  the  utterance.  Table  1-1  lists  the  consonants 
and  vowels  with  the  assigned  identification  number,  l or  example,  the  string  "2KXS9021805" 
and  FT  1 5 uniquely  designates  the  utte-ance  "•>'  str.vto"  spoken  bv  speaker  KNS  during  the 
second  recording  session. 


I 1 ')!)•] 


1 1 1 

I’M 

N 

S | 

9 

0 

2 

1 

8 

° 1 

__ _ J 

1 

S 

L _j 

T 

nr 

1 

RECORDING 

SESSION 

SPf  A«£R 
10 

leftmost 

CONSONANT 

N A I 

€ 1 1 Ml  N T 
•1  * 

CONSONANT 
ClOSCST  TO 
TM{  VOWEL 
OR  SONORANT 

POST 

STRESSED 

CONSONANT 

LEFT  MOSC  C OF*  SON  A Ft  T 
IN  A r-ELEFFENT  CLUSTER 
OR  FFIClDLt  CONSONANT 
IN  A 5 ElEFFENT  CLUSTER 


Fig.  3-12.  Format  of  the  II)  for  an  utterance  in  the  data-base. 


Haw  data  in  the  data-base  can  be  examined  in  two  basic  forms.  Figure  3-1  5 gives  an  ex- 
ample of  the  spectrographic  display  on  the  TX-2.  The  spectrogram,  admittedly  inferior  in 
quality  to  its  analog  counterpart,  in  that  acoustic  details  are  not  as  apparent,  provides  an  ex- 
cellent me  ans  to  examine  the  global  characteristics  of  the  utterance.  Simultaneous  displays  of 
time*,  synchronized  hand-labels,  the  RMS  amplitude*  function,  and  phonetic  content  with  the* 
spectrogram  makes  this  display  a powerful  tool  with  which  to  correlate  important  acoustic 
features. 

Figure  3-14  illustrates  the  second  form  of  data  displav  whe-e  eight  consecutive  spectral 
cross  sections  are  shown  with  the  corresponding  time  waveform.  The  vertical  bars  on  the 
waveform,  spaced  5 msec*  apart,  mark  the  center  of  each  analysis  window.  The  numbers  under 
the  vertical  bars  correspond  to  the  numbers  next  to  the  spectrum.  The  recording  session. 


25 


m — rj * r 


TABLE  3-1 


CORRESPONDENCE  BETWEEN  THE  VOWELS 
AND  CONSONANTS  AND  THE  INTERNAL 
REPRESENTATIONS  ON  THE  TX-2 


Consonant 

Infernal  Numerical 
Re  pre  sen  fat  ion 

Vowel 

P 

- i 

i 

t 

2 

i 

k 

3 

e 

b 

4 

e 

d 

5 

* 

9 

6 

a 

m 

7 

A 

n 

8 

o 

s 

9 

0 

i 

10 

u 

t 

11 

u 

e 

12 

3 

2 

13 

ay 

14 

oy 

V 

15 

aw 

8 

16 

1 

17 

r 

18 

w 

19 

V 

20 



26 


iparirm 


V ie.  3-13.  Hard  copy  spectrogram  display  from  TX-2.  (The  utterance  is 
"a'koyta",  recording  session,  speaker  ID,  and  maximum  RMS  amplitude 
are  shown  on  the  top  line,  followed  bv  hand-label  and  RMS  amplitude  func- 
tion; time  in  milliseconds.  ) 


F'ig.  3-11.  Hard  copy  consecutive  waveform  and  spectra  display  from  TX-2. 


speaker  initials,  and  phonetic  content  are  indicated  automatically  at  the  top  of  the  figure.  This 
display  is  useful  in  studying  rapid  spectral  changes  at  the  release  of  stops,  and  measuring  var- 
ious durations  of  speech  sounds. 

It  should  be  emphasized  that  the  most  prominent  feature  of  the  data-base  is  not  its  size  as 
much  as  the  accessibility  of  its  content.  By  simply  specifying  the  II)  of  the  desired  utterance 
(i.e.,  II)  string  and  FT  number)  and  typing  a few  commands,  waveform,  spectra,  etc.,  of  the 
utterance  become  available  almost  immediately.  Partial  specification  of  the  II)  will  result  in 
an  exhaustive  search  of  the  entire  data-base  for  utterances  fitting  the  description.  For  example, 
if  the  recording  session  and  speaker  initials  are  not  specified,  the  computer  will  return  all 
utterances  fitting  the  phonetic  context  specified.  Table  ?- 1 1 gives  an  example  of  the  tabulated 
results  of  an  experimental  session  where  the  vowel  identity  is  not  specified,  resulting  in  the  tab- 
ulation of  measurements  for  all  vowels. 

Table  3-II  also  illustrates  the  types  of  measurements  that  can  be  obtained  conveniently 
through  search  of  the  data-base.  Each  row  in  the  table  corresponds  to  a new  entry,  with  the 
phonetic  environment  specified  in  the  third  column.  Each  column  represents  a different  mea- 
surement. For  example,  columns  7 and  1 3 correspond  to  durational  measurements  of  VOT  and 
of  the  following  stressed  vowel,  respectively.  Column  14  measures  the  intensity  of  the  release 
for  the  stressed  plosives. 

Facilities  also  exist  for  averaging  and  displaying  the  results  of  an  experiment.  Figure  3-15 
shows  a plot  of  average  vowel  durations  in  the  context  of  a'  t ta  for  the  15  vowels  and  diph- 

thongs. The  horizontal  axis  represents  the  15  vowels  and  diphthongs,  and  the  vertical  axis 

measures  vowel  durations  in  milliseconds.  The  results  are  in  congruence  with  those  reported 

26  27 

in  the  literature  (e.g..  Mouse,  Peterson  and  I.ehiste  ). 

Figure  3- 1 f>  represents  a scatter  diagram  of  VOT  vs  burst  frequency  for  all  singleton  stops 
of  speaker  KNS.  The  terminology  and  results  will  be  discussed  in  later  chapters.  The  purpose 
of  introducing  this  figure  now  is  to  demonstrate  that  a large  amount  of  data  can  be  analyzed  and 
displayed  conveniently.  Close  examination  of  this  figure  reveals  that  over  95  percent  of  the 
voiceless  stops  have  a VOT  greater  than  40  msec,  and  that  coronal  stops  in  general  have  a 
burst  frequency  greater  than  3000  Hz.  The  time  required  to  generate  such  a plot  is  of  the  order 
of  2 to  5 minutes,  depending  on  the  size  of  the  data  set. 


BURST  FREQUENCY  ( Hz > 


Fig.  3-15.  Average  vowel  duration  in  the  syllable  tVt;  speaker 


' . inti; 


VOT  .<  BURST  FREOUENCY,  ALL  STOPS 
NUMBER  OF  SAMPLES  -270 


k NS. 


Fig.  3-1k.  Seatter  diagram  of  VOT  v«  burst  frequency  obtained  on  TX-2 


CHAPTER  4 

TEMPO RAI.  CHARACTERISTICS  OF  ENGLISH  STOPS 


The  effort  described  in  Chaps.  Z and  3 is  directed  toward  the  development  of  a general 
facility  where  controlled  studies  of  the  acoustic  characteristics  of  selected  consonants,  conso- 
nant clusters,  and  vowels  in  a prescribed  phonetic  environment  can  be  carried  out.  The  type 
of  acoustic  data  we  have  collected,  and  the  process  through  which  they  were  collected,  have 
been  described  in  detail  in  Chap.  Z.  The  structure  of  the  data-base  and  the  facility  were  illus- 
trated in  Chap.  3. 

The  next  two  chapters  deal  with  the  results  of  a study  of  the  English  stops,  /p,t,k.b,d,g/, 
using  the  collected  data  and  the  data-base  facility.  Various  aspects  of  the  temporal  character- 
istics of  stops  in  prestressed  position  will  be  investigated  in  this  chapter,  with  results  on  the 
spectral  characteristics  presented  in  Chap.  5. 

While  certain  aspects  of  our  results  might  have  been  reported  in  the  past  by  others,  it  is 
felt  that  the  careful  control  of  the  phonetic  context,  the  improved  measuring  techniques,  and 
the  analysis  of  a large  corpus  of  data  of  the  present  study  will  provide  us  with  statistically 
more  reliable  results,  which  in  turn  will  make  the  interpretation  of  findings  easier. 

In  this  chapter,  we  first  define  the  different  terminologies  that  we  will  use  in  describing 
the  temporal  characteristics  of  stops.  Results  on  the  durations  of  these  stops,  both  in  single- 
ton  and  in  clusters,  as  a function  ot  their  underlying  features  and  the  phonetic  environment,  will 
chen  be  presented. 

4.1  MEASUREMENTS  AND  TECHNIQUES 

The  production  of  prestressed  plosives  in  the  a'CVCa  context  is  marked  acoustically  by 
several  stages.  The  closure  phase  occurs  after  the  initial  schwa  (or  the  initial  /s/  in  certain 
clusters)  and  is  produced  articulatorily  by  forming  a complete  constriction  somewhere  in  the 
vocal  tract.  Acoustically,  this  phase  is  characterized  by  the  absence  of  sound  energy  in  the 
resulting  speech  signal  for  unvoiced  stops.  For  the  prevoiced  stops,  however,  there  exists  a 
small  amount  of  acoustic  energy  at  very  low  frequencies  during  the  closure  phase,  due  to  the 
spontaneous  vibration  of  the  vocal  cords  and  the  subsequent  radiation  of  sound  energy  through 
the  walls  of  the  vocal  tract. 

During  the  closure,  air  pressure  continues  to  build  up  behind  the  constriction  and  is  re- 
leased finally  by  a sudden  opening  of  the  constriction.  The  release  is  usually  accompanied  by 
a burst  of  frication  noise  as  air  rushes  through  the  narrow  constriction,  creating  turbulence 
noise.  The  opening  at  the  constriction  continues  to  increase,  until  finally  it  becomes  large 
enough  such  that  turbulence  noise  can  no  longer  be  generated.  Acoustically,  the  release  is 
marked  usually  by  a sudden  increase  of  sound  energy  at  all  frequencies,  and  the  burst  at  the 
release  usually  has  energy  with  frequency  characteristics  depending  upon,  among  other  things, 
the  position  of  the  constriction  in  the  vocal  tract. 

For  the  voiced  (unaspirated)  stops  in  English  (i.e.,  /b.d.g/),  voicing  may  continue  through- 
out the  release,  and  after  the  burst  of  frication  noise,  normal  voicing  of  the  following  vowel 
(or  sonorant)  commences.  The  release  of  the  aspirated  English  stops  (i.e.,  /p,t,k/l,  however, 
is  quite  different.  Since  there  is  no  spontaneous  voicing  before  the  release,  the  vocal  cords 
are  not  adjusted  to  a configuration  or  state  appropriate  for  vibration.  As  the  constriction  opens, 
the  vocal  cords  are  brought  closer  to  each  other,  and  the  tension  of  the  cords  is  also  adjusted 


31 


so  th.it  voicing  for  the  following  vowel  (or  sonorantl  ran  start.  As  the  size  of  the  glottis  de- 
creases. noise  is  generated  at  the  glottis.  This  noise  source  in  turn  excites  the  vocal  tract 
and  creates  an  aspirated  sound. 

The  aspiration  has  several  acoustic  characteristics  that  distinguish  it  from  the  frication 
noise  of  the  burst  release.  The  noise-  source  during  the  hurst  is  directly  in  front  of  the  con- 
striction and,  for  a small  enough  constriction,  excites  only  those  resonances  that  are  assoc  i- 
ated with  the  front  cavity.  The  noise  source  for  aspiration  is  situated  close  to  the  glottis  and, 
therefore,  all  except  the  lowest  resonance  of  the  vocal  tract  arc  being  excited.  Movements  of 
the  articulators  and  the  resulting  formant  motions  can,  for  the  most  part,  be  observed  on  the 
spectrogram  during  aspiration.  \ not  her  characteristic  of  aspiration  is  that  there  is  a certain 
degree  of  coupling  between  the  supra-  and  subglottal  cavities,  depending  on  the  opening  at  the 
glottis.  Subglottal  formants  (Kant  et  al.  1 frequently  appear  during  aspiration. 


•CLOSURE 
, RELEASE 
t i i c 


Fig.  4-1.  Illustration  of  the  various  durational 
measurements  made  in  this  study. 


Figure  4-1  illustrates  the  various  durational  measurements  that  are  made  on  the  stop 

consonants.  The  closure  interval  is  measured  from  the  end  of  the  preceding  schwa  (or  /s/  in 

certain  clusters)  to  the  release  which,  in  general,  is  marked  by  a sharp  increase  in  acoustic 

energy.  Both  the  beginning  and  the  end  of  the  closure  interval  can  usually  be  identified  quite 

easily,  except  in  the  case  of  /p,b/  whose  releases  are  often  weak,  partly  due  to  the  fact  that 

the  constriction  is  formed  at  the  lips  with  no  excitable  cavity  in  front  of  the  noise  source.  The 

release  interval  is  measured  from  the  onset  of  burst  release  to  where  the  time  waveform  first 

shows  signs  of  periodicity  following  the  release.  For  the  voiceless  aspirated  stops,  the  release 

interval  is  further  divided  into  frication  and  aspiration  intervals. 

f>  8 

As  noted  by  other  investigators  (I  iskerand  Abramson.  Klatt  ),  voiced  stops  in  such  intervo- 
calic positions  often  maintain  voicing  during  the  closure  interval.  Voicing  may  continue  through 
the  plosive  release,  or  the  voc  al  cords  may  cease  to  vibrate  when  the  supraglottal  pressure 
buildup  becomes  too  great.  Since  prevoicing  is  not  a phonemic  determinant  in  English,  all  pre- 
voicing is  ignored  in  our  study.  For  the  remainder  of  this  chapter,  the  term  VOT  will  In-  used 
to  designate  the  duration  of  the  release  interval  for  voiced  and  voiceless  stops  alike  For  the 


)2 


unvoiced  stops,  VOT  is  really  a measure  of  the  hurst  duration,  and  should  he  interpreted  with 
caution. 


As  mentioned  earlier,  although  the  articulatory  gesture  and  the  positions  of  the  noise 
sources  are  quite  different  for  hurst  and  aspiration,  the  acoustic  manifestation  of  these  two 
phases  of  the  stop  release  is  not  always  so  unambiguous.  In  fact,  there  is  invariably  a certain 
degree  of  overlap  in  the  acoustic  cues,  making  the  effort  to  distinguish  them  extremely  difficult. 
The  only  way  to  deal  with  such  a problem  is  to  lay  down  a set  of  specific  rules  and  then  adhere 
to  them  as  consistently  as  possible.  In  making  the  boundary  between  frication  and  aspiration, 
we  primarily  use  the  knowledge  that  the  burst  usually  has  acoustic  energy  concentrated  within 
a certain  region,  subglottal  formants  frequently  appear  in  aspiration  and  second  and  higher  for- 
mants can  usually  be  traced  from  the  following  vowel  (or  sonorantl  hack  into  the  aspiration  of 
the  p re  stressed  stop. 

4.2  SI  MMARY  OK  DATA 

Table  4-1  summarizes  the  nature  of  the  data  used  in  this  study.  It  contains  1728  utterances 
spoken  by  1 male  speakers.  Fifteen  vowels  were  used  to  form  the  syllable  nuclei  as  shown  in 
Table  4-1.  Besides  the  singleton  stops  /p, t.k.h.d.g/,  certain  word-initial  clusters  containing 
stops  are  also  included.  For  the  singleton  stops,  each  C-V  combination  contains  9 separate 
tokens  from  3 speakers  C3  tokens  from  speaker  KNS,  3 from  UIIK,  and  1 from  ISK).  For  stops 
in  clusters,  each  C-V  combination  contains  3 separate  tokens  from  only  2 speakers  (2  from 
KNS  and  1 from  UHK).  The  poststressed  consonant  is  either  /t/  or  /d/. 


4.  3 RESl  I TS 

Since  variability  among  speakers  and  recording  sessions  has  not  lieen  appreciable,  the 
re  ul'-  prc -.eiited  in  1 lie  remainder  of  this  chapter  have  been  pooled  within  and  across  speakers, 
unless  otherwise  spei  ified. 


33 


1 


4.  1. 1 Singleton  Stops 

Results  on  the  VOX  for  the  voiced  stops  /b,d,g/  are  shown  in  Fig.  4-2.  Avenging  across 

place  of  articulation,  the  mean  VOT  was  found  to  l>e  20.6  msec.  This  value  is  smaller  than 

that  reported  for  monosyllabic  words  spoken  in  isolation  (I  isker  and  AbramsonS  and  somewhat 

larger  than  that  reported  for  words  excised  from  spoken  sentences  (I. isker  and  Abramson/ 

H 

Klatt  ).  Standard  deviations  of  the  measurements  were  1 msec  for  /b.d  and  6 msec  for  g . 
These  variabilities  are  again  consistent  with  those  reported  in  the  literature. 


Fig.  4-2.  Average  VOX  for  the  singleton 
voiced  stops. 


5 
•» 

6 

O 


Fig.  4-3.  VOT  for  the  singleton  voiced  stops  as 


function  of  vowel  context. 


34 


lie-suits  in  Fig.  4-2  also  show  that  the  V()T  varies  as  a function  of  the  place  of  articulation 
of  the  stop.  Mean  VOT  is  15  msec  for  /b/,  IS  msec  for  /d/,  and  50  msec  for  /g/.  Release 
time  is  consistently  smaller  than  the  mean  for  labials  and  larger  than  the  mean  for  velars.  The 

dependency  of  VOT  on  the  place  of  articulation  of  the  stop  has  been  observed  by  other  invest!- 

8 7 

gators  in  Kngli sh  (Klatt  ),  as  well  as  across  several  languages  (Lisker  and  Abramson  ). 

The  VOT  for  the  voiced  stops  is  shown  in  Fig.  4-5  as  a function  of  the  vowel  environment. 
As  in  the  case  of  the  combined  results  shown  in  the  previous  figure,  the  VOT  for  velars  is  con- 
sistently longer  than  for  dentals,  which  in  turn  is  longer  than  for  labials.  No  consistent  depen- 
dency of  VOT  as  a function  of  the  following  vowel  was  found  across  all  three  stops,  although 
some  individual  variations  appear  to  be  sensitive  to  the  vowel  context.  For  example,  VOTs  for 
g ' s preceding  back  vowels  are  among  the  longest,  whereas  VOTs  for/b/'s  preceding  low  vow- 
els are  the  shortest. 

Figure  4-4  summarizes  results  on  the  VOT  for  the  voiceless  aspirated  stops  /p,t,k/. 
\veraging  across  place  of  articulation,  the  mean  VOT  was  found  to  be  67. s msec.  As  in  the 
case  of  the  VOT  for  voiced  stops,  the  VOT  is  longest  for  velars  (7  msec  longer  than  the  mean 
value)  and  shortest  for  labials  (9  msec  shorter  than  the  mean  value).  Standard  deviations  were 
14  msec  for  /p/,  7 msec  for  't/,  and  It  msec  for  /k/.  The  larger  standard  deviation  for  p/ 
can  presumably  be  attributed  to  measurement  error;  since  the  release  of  /p/,  with  virtually 
no  excitable  cavity  in  front  of  the  noise  source,  is  generally  weak  and  difficult  to  locate. 


Fig.  4-4.  Average  VOT  for  the  singleton 
voiceless  stops. 


55 


Also  plotted  in  Fig.  4-4  are  the  durations  of  the  frication  and  aspiration  nervals.  The  burst 
duration  for  the  v 'celess  aspirated  stops  is,  on  the  average,  IS  msec  longer  than  the  voiced 
stops.  Aspiration  durations  for  the  three  stops  are  approximately  equal,  thus  leaving  the  burst 
durations  for  the  three  voiceless  stops  with  the  same  relationship  as  described  previously. 

In  measuring  the  durations  of  the  closure  interval  for  the  voiceless  aspirated  stops,  it  was 
found  that  the  inverse  relationship  holds.  The  closure  interval  is  longest  for  /p  ' and  shortest 
for  / k . In  fact,  as  shown  in  Fig.  4-5,  where  closure  as  well  as  voicing  onset  durations  are 
included,  the  total  duration  of  these  stops,  measured  from  the  beginning  of  silence  to  the  first 
onset  of  voicing,  was  found  to  lie  practically  identical.  The  total  durations  for  p ' and  /t/  are 
150  msec,  whereas  the  total  duration  for  /k / was  found  to  be  148  msec. 

VOT  for  p.t.k  as  a function  of  the  vowel  context  is  shown  in  Fig.  4-f>.  Contrary  to  the 
findings  by  Klatt,  there  appears  to  be  no  systematic  variation  of  VOT  as  a function  of  whether 
or  not  the  following  vowel  i>  high.  V()T  for  t ' shows  no  appreciable  variation  from  one  vowel 
to  mother.  V<  • I • • k • longer  preceding  i.u/,  both  having  the  feature  I * high].  However, 
VOT  for  k is  also  long  preceding  ay, ay/. 

ruble  4-11  ■ I-  • ...  results  m the  dependency  of  VOT  on  the  underlying  features  of  the 

following  vowel.  1 • . • . | |e.  the  first  row  in  Table  4-11  lists  averaged  VOTs  for  /p,t,k/ 

preceding  high  \ we;  it  'he  ond  row  lists  averaged  VOTs  for  /p.t.k/  preceding  vowels 
which  are  not  high.  Ihc  I cections  of  the  feature -dependency  from  one  stop  to  another  have 
not  been  t onsi stent  tot  ill  the  features  tested.  Neither  did  the  averaged  VOTs  in  the  last  col- 
umn -show  any  signifii  ant  difference  for  all  the  features  tested. 


TABLE  4-11 

AVERAGE  VOT  FOR  SINGLETON  VOICELESS 
STOPS  AS  A FUNCTION  OF  VOWEL  FEATURES 
(Each  Entry  Represents  the  Average  of  All 
Tokens  Preceding  Vowels  with  the  Given  Feature) 

t 

k 

. . 

All 

Feature  of  the  following 

vowel 

+ High 

70.0 

76.3 

66.1 

-High 

60.9 

71.5 

72.5 

68.3 

+ Rounded 

61.2 

70.5 

68.9 

- Rounded 

56.8 

71.4 

72.4 

66.8 

+ ATR* 

56.0 

71.7 

75.0 

67.6 

-ATR 

58.5 

71.4 

66.9 

♦ Back 

61.0 

70.9 

74.3 

68.7 

-Back 

53.6 

71.4 

71.8 

65.6 

All  vowels 



58.4 

70.8 

73.2 

67.5 

* ATR  - advanced  tongue  root. 

17 


VOT  IrnMd 


I ig.  4-7.  Average  VOT  for  the  Voiced 
stops  in  stop-sonorant  clusters. 


Kig.  4-8.  Average  VOT  for  the  voiceless 
stops  in  stop-sonorant  clusters. 


Tig.  4-9.  Average  VOT  for  the  stops 
in  clusters  as  a function  of  the  following 
sonorant. 


Kig.  4-10.  Average  VOT  for  the  voiceless 
stops  in  /s/  clusters. 


38 


F 


4.3.2  Stops  In  Clusters 

\ total  of  21  word-initial  consonant  clusters  containing  stops  has  been  investigated  in  this 
study  as  shown  in  Table  4-1.  Fourteen  of  the  2-element  clusters  involve  stop-sonorant  combi- 
nations. with  the  remaining  clusters  involving  /s/-stop  or  /s/-stop-sonorant  combinations. 

It  should  be  reiterated  that  in  the  case  of  stops  preceding  a sonorant.  the  durational  measure- 
ments were  made  such  that  the  end  of  the  stop  is  marked  by  the  first  signs  of  periodicity  in  the 
time  waveform  following  the  release. 

Results  on  the  VOT  for  the  voiced  stops  /b.d.g/  in  clusters  are  shown  in  Fig.  4-7.  For  the 
-ike  of  comparison,  values  for  the  singleton  stops  were  also  plotted  alongside.  Averaging 
across  place  of  articulation,  the  mean  V< >T  was  found  to  be  28.3  msec,  a 5. 7-msec  (or  28  per- 
cent) increase  from  stops  in  isolation.  The  mean  VOT.  as  a function  of  the  place  of  articula- 
tion of  the  stops,  varies  in  the  same  fashion  as  stops  in  isolation.  Figure  4-7  also  shows  that 
the  increase  in  VOT  from  singleton  to  clusters  is  the  greatest.  7 msec  or  37  percent,  for 
dentals. 

Figure  4-8  summarizes  results  on  the  VOT  for  the  voiceless  stops  in  clusters.  Averaging 
across  place  of  articulation,  the  mean  VOT  was  found  to  be  88. 8 msec,  an  18.1-msec  (or 
27  percent)  increase  over  stops  in  isolation.  As  in  the  case  of  voiced  stops  in  clusters  the 
increase  in  VOT  is  the  greatest,  28.4  msec  or  40  percent,  for  dentals.  In  fact,  the  results  in 
Fig.  4-8  indicate  that  dentals  in  clusters  have  a mean  VOT  greater  than  that  of  velars. 

The  results  in  Figs.  4-7  and  -8  can  lie  broken  down  further  with  respect  to  the  following 
sonorant.  as  is  shown  in  Fig.  4-9.  The  mean  VOT  for  stops  in  clusters,  averaged  over  all  vow- 
el context,  varies  from  IS  msec  for  /fcl / to  100  msec  for  'tw/.  The  increase  in  VOT  over 
singleton  stops  ranges  from  1.5  msec  for  /gl  to  29.3  msec  for  /tw  '. 

Results  of  VOT  for  stops  in  /s/ clusters  are  shown  in  Fig.  4-10.  The  VOT  decreases 
sharply  when  a voiceless  stop  appears  in  /s/  clusters.  Averaging  across  place  of  articulation, 
the  mean  VOT  for  stops  in  /s/  clusters  preceding  a vowel  was  found  to  be  22.7  msec.  This 
value  is  just  slightly  higher  than  the  mean  VOT  found  for  voiced  stops  in  isolation.  When  the 
stops  appear  in  /s/ clusters  preceding  a sonorant,  the  mean  VOT  was  found  to  be  30  mst  . 

4.4  DISCI  S.SION 

The  results  reported  in  the  previous  section  show  that  even  in  a controlled  environment, 
i.e.  the  prestressed  position,  the  VOT  data  span  a wide  range  of  variations.  The  averaged 
VOT  for  voiced  stops  ranges  from  13  msec  for  /b / to  38  msec  for  gw  . The  averaged  V(  >T 
for  voiceless  unaspirated  stops  ranges  from  15  msec  for  sp  to  38.7  msec  for  str  and  the 
averaged  VOT  for  voiceless  aspirated  stops  ranges  from  58.4  msec  for  p to  100  msec  for 
Tw/.  Fven  with  identical  phonetic  environment,  a certain  degree  of  overlap  in  VOT  can  be 
found  between  voiced  and  voiceless  stops.  For  example,  22  or  18  percent  of  singleton  /p  's 
have  VOT  less  than  or  equal  to  40  msec,  whereas  18  or  13  percent  of  singleton  g " s have  VOT 
greater  than  or  equal  to  40  msec.  Our  data  reaffirm  the  finding  by  Klatt  that  the  perceptual 
decision  on  the  voicing  feature  for  English  stops  cannot  he  made  on  the  basis  of  VOT  alone. 

Some  derivative  of  the  VOT.  such  as  the  presence  or  absence  of  rapid  spectral  changes  at  the 

29 

onset  of  voicing  suggested  by  Stevens  and  Klatt,  or  some  a priori  knowledge  of  the  phonetic 
environment  and  the  place  of  articulation  of  the  stops  may  play  a role  in  the  distinction  between 
voiced  and  voiceless  stops. 


30 


-ry. 


f 


The  increase  of  VOT  from  /b/  to  /d/  to  /g/  can  be  explained  by  the  position  and  shape  of  the 
oral  constriction,  as  well  as  the  articulators  involved  in  forming  the  constriction  (Klatt  ).  The 
constriction  for  /b/  is  formed  at  the  lips,  which  can  move  away  quite  rapidly  following  the  re- 
lease. As  mentioned  earlier,  the  labial  release  is  usually  weak  in  intensity,  which  also  con- 
tributes to  the  appearance  of  a short  burst. 

The  constriction  for  /g J is  formed  by  the  tongue  body,  which  is  rather  massive  and  cannot 
move  away  from  the  palate  too  rapidly  following  the  release.  It  has  also  been  observed  (Houde, i0 
Perkell51)  that  the  motion  of  the  tongue  body  after  the  release  is  in  such  a way  that  a tapered, 
narrow  opening  is  maintained  for  a longer  period  of  time.  Therefore,  the  constriction  for  /g/ 
opens  slowly,  allowing  turbulence  noise  to  be  generated  for  a longer  period  of  time. 

The  burst  durations  for  /p.t.k/  have  the  same  relationship  as  that  of  the  voiced  stops.  This 
inherent  difference  can  presumably  be  explained  in  the  same  way  as  outlined  above. 

That  the  voiceless  aspirated  stops  all  have  identical  total  duration  is  somewhat  surprising. 
This  result  cannot  be  attributed  simply  to  the  rhythmic  pattern  and  constant  interstress  interval 
of  the  utterance.  A possible  explanation  can  be  proposed  based  on  the  laryngeal  behavior  during 
the  production  of  these  stops  and  its  timing  relative  to  the  release  of  the  supraglottal  constric- 
tion (Stevens52). 


e 


I I 

t t 


CLOSURE  ONSET  VOICING  ONSET 


SUPRAGLOTTAl 

OPENING 


GLOTTAL 

SPREADING 


Fig.  4-11.  Schematized  relationship  between  glottal  spreading 
and  supraglottal  opening. 

The  time  course  of  events  in  the  production  of  voiceless  aspirated  stops  is  schematized  in 
Fig.  4-11.  The  supraglottal  movement  from  the  preceding  schwa  to  the  following  vowel  is  shown 
as  a simple  closing  and  opening  at  the  place  of  articulation  of  the  stop.  The  vocal  cords,  how- 
ever, will  be  stiffened  and  the  glottis  must  be  spread  apart  to  prevent  voicing  and  allow  the 
generation  of  aspiration  after  oral  release  (Halle  and  Stevens55).  If  we  propose  that  the  timing 
of  the  stop  release  is  controlled  by  two  more  or  less  separate  mechanisms,  one  for  the  supra- 
glottal release  and  one  for  glottal  abduction,  then  the  phenomena  observed  above  can  be  r plained 
in  the  following  way.  The  constant  total  duration  of  the  stops  could  be  a direct  consequence  of 
the  fart  that  the  amount  of  time  required  to  spread  and  close  the  glottis  is  independent  of  the 
timing  of  the  supraglottal  movement.  Given  this  constant  time  interval,  the  release  will  then 
have  to  be  adjusted,  according  to  the  articulators  involved,  such  that  by  the  lime  of  voicing  onset. 


40 


■n 


essentially  all  the  transition  is  completed.  This  lack  of  rapid  spectral  change  at  the  onset  of 

voicing  1^  necessary  for  the  proper  fiereeption  of  the  voiceless  stops,  as  suggested  by  Stevens 

>q 

and  Klatt. 

The  im  rease  in  VOT  for  stops  in  slop-snnorant  < lusters  has  hern  reported  in  the  past  in 
phonetii  literature.  Our  data  indicate  that  the  following  voiced  segment  is  lengthened  on  the 
average  by  It  msec  for  voiceless  stops  and  by  11  msec  tor  voiced  stops.  I hcse  data  support 

U 

the  claim  by  Klatt  that  only  the  initial  part  of  the  sonorant  is  devoiced.  In  fact,  as  suggested 
by  Klatt.  the  greater  increase  for  voiceless  stops  could  be  a phonological  rule  where  the  sono- 
rant is  lengthened  following  voiceless  stops  so  that  a substantial  amount  of  the  transition  still 
remains  after  voicing  onset. 

The  greater  increase-  of  VOT  for  dental-sonorant  clusters  is  probably  a direct  consequence 
of  the  coarticulatory  effect.  w in  stop-sonorant  clusters  forms  a secondary  constriction  due 
to  tip  rounding.  \ brief  interval  of  silence  often  can  be  observed  between  the  aspiration  and 
voicing  onset,  due  to  a lack  of  acoustic  output.  Such  a silent  interval  may  account  for  the  in- 
crease in  VOT. 

When  dentals  appear  in  dental -/r  clusters,  the  larger  increase  in  VOT  may  again  hi-  , 
consequence  of  the  articulatory  constraints.  Since  the  stop  and  the  sonorant  both  utili/i  the 
tongue  tip  and  r ' tends  to  curl  the  tongue  blade  toward  the  back,  the  release  is  such  that  the 
constriction  is  opened  slowly.  Klatt  also  has  observed  a longer  burst  duration  for  the  dental  - 
r ' cluste  rs. 

Many  as|>eets  of  our  results  are  in  good  agreement  with  the  data  reported  recently  by  Klatt 
on  a substantially  smaller  data-base.  We  are.  however,  unable  to  find  any  dependent  % of  VOT 
on  the  vowel  context  to  support  his  claim  that  VOT  is  longer  following  high  vowels.  This  dis- 
crepancy probably  can  be  attributed  to  a difference  in  measurement  technique.  Most  of  the  past 

(7  u 

studies  (e.g.,  l.isker  and  Abramson,  ’ Klatt  I utilize  spectrograms,  and  VOT  is  measured  from 
the  release  to  the  onset  of  voicing,  defined  as  the  time  where  second  and  higher  formants  are 
visibly  excited  by  voicing  striations.  In  the  present  study.  VOT  is  measured  from  the  release 
to  the  time  where  the  waveform  first  shows  signs  of  periodicity.  At  the  onset  of  voicing,  the 
initial  glottal  pulses  may  be  produced  with  the  glottis  still  relatively  spread.  I'his  "edge  vibra- 
tion" of  the  VIII  il  cords  will  lend  itself  to  a vibration  pattern  that  is  initially  weak  in  high- 
frequency  energy.  Therefore,  depending  on  the  amount  of  second  and  higher  formant  cutback, 
the  two  measuring  techniques  can  result  in  different  VOT  values.  Measuring  \<>r  from  visible 
excitation  of  higher  formants  will  perhaps  result  in  a more  perceptually  based  interpretation 
of  data,  whereas  measuring  VOT  directly  from  the  waveform  will  result  in  data  which  can 
possibly  shed  more  light  on  the  time  course  of  articulatory  events  in  the  production  of  these 
stops. 


41 


Vf* 


CHAPTER  «> 

SPKCTRA1  CHARACTERISTICS  OF  ENGLISH  STOPS 


This  chapter  reports  our  findings  on  the  spectral  i haracteristii  s of  English  stops,  in 
prestressed  position,  using  the  collected  data  and  the  developed  facility  as  described  earlier, 
t he  data  set  is  the  same  as  the  one  used  in  the  previous  chapter. 

Whereas  the  temporal  characteristics  of  stops.  su<  it  as  closure,  burst,  and  voicing  onset 

durations,  have  been  studied  extensively  both  through  |>e rceptual  experiments  using  synth  it 

speech  and  analysis  of  real  speech  data  he  spectral  characteristics  of  stops  have  been  studied 

primarily  in  the  context  of  experiments  looking  for  perceptually  important  acoustic  cues.  These 

? 19 

perceptual  experiments  (for  example,  t ooper  ot  ah.  Stevens  and  klatt  1 use  almost  exclusively 
synthetic  speech  material.  While  it  is  true  that  the  synthesis  features  arc  determined  through 
our  acoustic  phonetic  knowledge,  synthetic  speech  might  sometimes  be  an  overly  simplified,  or 
even  erroneou  , representation  of  the  real  data.  The  problem  lies  both  in  the  limited  scope  of 
the  data  examined  and  the  methods  by  which  they  were  analyzed.  Most  of  the  past  observation-, 
and  measurements  were  made  on  spectrograms,  which  have  a very  limited  dynamic  range  to 
represent  spectral  intensity.  locating  the  exact  frequency  of  a stop  burst  from  the  spectrogram 
is  extremely  difficult,  and  deducing  the  spectral  shape  at  the  release  is  next  to  impossible.  In 
addition,  the  AGO  circuitry  in  a spectrograph  machine  has  a tendency  to  increase  the  hurst  in- 
tensity following  the  silence  interval,  thereby  making  the  burst  appear  to  be  more  intense  than 
it  really  is.  Quantitative  data  collected  from  real  speech  data,  such  as  these  reported  here, 
arc  undoubtedly  necessary  to  substantial  and  corroborate  the  findings  of  the  perceptual 
experiments. 

This  chapter  concentrates  on  two  aspects  of  the  stop  release,  namely,  the  peak  intensity 
of  the  stop  burst  and  the  spectral  characteristics  of  the  stop  10  or  IS  msec  following  the  release. 
We  shall  begin  by  first  defining  the  measurements  that  were  made,  and  then  we  shall  present 
the  results  and  discuss  their  implications. 

5.1  MKAS1  RKMKNTS  AND  1KC  1INIQI  ES 

For  the  remainder  of  this  chapter,  we  shall  use  the  term  hurst  frequency  to  designate  the 
frequency  of  maximum  spectral  amplitude  in  the  spectrum  computed  from  the  first  10  to  IS  msec 
of  the  waveform  follow  ing  the  release.  For  display  and  measurement  purposes,  the  computed 
spectrum  is  postemphasized  with  a frequency  response  similar  to  that  of  a conventional 
spectrograph. 

Hurst  spectra  computed  from  linear  | rediet  ion  often  show  sharp  spectral  peaks.  The  rel- 
ative sharpness  of  these  spectral  peaks  is  a direct  consequence  of  using  linear  prediction  as  a 
spectral  smoothing  technique  (Makhoul  and  Wolf*  I.  In  any  case,  we  are  interested  in  the  fre- 
quency locations  of  major  energy  concentrations,  not  in  the  effect  of  the  local  spectral  maxima. 
Therefore,  the  burst  spectra  computed  from  linear  prediction  are  further  smoothed,  using  a 
3-sample  digital  filter,  to  minimize  the  effect  of  the  high-Q  poles.  This  final  smoothing  enables 
us  to  locate  the  burst  frequency  reliably  through  a semiautomatic  process  where  the  program 
automatically  locates  and  displays  the  burst  frequency  and  the  user  can  interactively  modify 
the  result  before  filing.  Our  experience  has  lieen  that  the  occasions  where  user  intervention  is 
needi-d  have  been  rare,  on  the  order  of  one  percent  of  our  data  set. 


r^CEDUC  pa:}£ 

LANK.N0T  FriMyD 


jL 


(b) 


1 1> 


I 1 X.  5 - 1 , Spectra  of  ><  k/  hurst  (a)  before 
ind  lb>  after  furthei  smoothing.  (Pointer 
indicates  burst  frequency . 


Fig.  5-2.  Spectra  if  a > bui  I 

and  ( l> i after  further  smoothing  (Pointer 

indicates  burst  frequency.  I 


I*' ig.  5-3.  ( omposite  displa \ • »f  t he  -[<•  ■ ! i .« 

of  a /d/  burst  and  the  follow  inc  vowel  /•-*,  . 
(Both  spectra  have  been  smooth*  i.  M • ef- 
ferent o in  amplitude  is  about  l -IB.  i 


44 


f 


Figures  c,-i  in,|  ->  give  examples  of  the  procedure  outlined  above.  In  Fig.  5-1,  the  burst 
frequency  rre  pond  - exactly  to  the  peak  in  the  original  spectrum.  This  correspondence  is 
invariably  the  rase  for  velars.  For  dentals  (as  shown  in  Fig.  5-2).  the  burst  frequency,  as 
defined  here,  often  does  not  coincide  with  spectral  peaks  in  the  original  spectrum. 

Once  the  burst  frequency  is  located,  the  amplitude  of  the  burst  is  also  determined,  as 
-hown  in  Figs.  5-1  md  -2.  For  the  sake  of  comparison,  measurement  is  also  made  on  the 
amplitude  of  the  spe.  tral  jieak  of  the  following  vowel  in  the  same  frequency  region.  Spectral 
amplitude  for  the  vowel  is  made  on  the  smoothed  spectrum  computed  at  the  midpoint  of  the 
vowel.  For  the  remainder  of  this  chapter,  the  term  burst  amplitude  is  used  in  a relative  sense, 
defined  ,s  the  difference  between  the  burst  amplitude  ami  the  vowel  amplitude.  \ positive  value 
would  indicate  that  the  burst  is  higher  in  amplitude  than  the  following  vowel. 

Figure  5-1  shows  the  hurst  spectrum  of  a /d/  in  the  syllable  /da’/.  The  spectrum  at  the 
midpoint  of  the  vowel  is  superimposed.  For  this  example,  the  hurst  amplitude  is  on  the  order 
of  -2  dll. 

The  overall  RMS  amplitude  of  the  hurst  is  also  measured  in  this  study.  The  KMS  amplitude 
is  computed  on  the  first  10  to  15  msec  of  the  stop  release,  and  is  normalized  by  the  maximum 
KMS  amplitude  for  each  utterance.  Although  the  maximum  RMS  value  is  always  within  the 
stressed  vowel,  the  location  of  the  maximum  rarely  occurs  at  the  midpoint  of  the  vowel. 

5.2  REST  I TS 

Results  presented  in  this  chapter  are  pooled  across  speakers  and  recording  sessions. 
Characteristics  attributable  to  interspeaker  variations  will  be  discussed  whenever  appropriate. 

It  should  also  be  mentioned  that  the  results  on  stops  in  clusters  are  obtained  from  a smaller 
sample  than  those  on  singleton  stops. 

5.2.1  Singleton  Stops 

Before  presenting  any  quantitative  data,  it  is  appropriate  to  examine  first  some  qualitative 
characteristics  of  the  stop  releases.  Figure  5-4  shows  the  general  spectral  characteristics 
at  the  release  of  I and  k;  for  two  different  speakers.  The  pictures  in  Fig.  5-4  are  multiple- 
exposure  shots  obtained  from  some  forty  stops.  Since  the  burst  intensity  varies  from  one 
utterance  to  another,  all  spectra  in  Fig.  5-4  have  been  normalized  by  their  respective  geometric 
means.  This  normalization  procedure,  which  tends  to  reduce  the  amplitude  differences  among 
spectra  and  produce  a cleaner  clustering,  is  used  strirtly  for  display  purposes. 

Although  the  frequency  locations  of  spectral  concentrations  vary  from  one  speaker  to 
■mother,  the  bursts  of  l and  'k/ exhibit  some  general  spectral  characteristics  independent 
of  speaker  and,  to  some  extent  context.  The  /t , bursts  are,  in  general,  rather  broadband 
and  occupy  primarily  the  high-frequency  region  of  the  spectrum,  sav.  above  2000  Hz.  The  'k 
bursts  show  remarkably  dissimilar  shapes  depending  on  the  nature  of  the  following  vowel.  The 
hurst  of  a /k/  preceding  a front  vowel  has  a more  compart  spectral  shape  than  the  /t  hurst, 
and  the  location  of  the  hurst  is  in  the  mid-frequenry  region.  The  hurst  of  a /k/  preceding  a 
back  vowel  has  a predominant  sharp  peak  in  the  low-frequency  region.  In  addition,  a secondary 
spectral  peak  at  high  frequency  can  always  tie  observed  for  the  back  /k/'s.  This  secondary 
peak,  presumably  asses  died  with  the  J/4-wavelength  resonance  of  the  front  cavity  (Stevens  I. 
ha  - generally  ts*en  ignored  in  describing  the  ‘k/  burst.  Inclusion  of  this  additional  peak  has 
been  shown  to  improve  the  quality  of  synthetic  /k/'s  iKlatt.  personal  communication!. 


45 


r 


(. ’ lost* r examinat ion  ol  Fig.  5-4  ,dso  it\c.iIs  thut  the  burst  location  for  buck  k is  ti* 
of  luii  frcquenrv  regions.  depending  "ii  the  towel  context.  Similar  results  can  la- 
the t bursts. 

The  average  RMS  amplitudes  of  the  burst  for  p.t  k.b.d.g  are  shown  in  T ■ . ie  data 

in  Table  S-l  indicate  that  there  is  no  significant  difference  -n  H.11S  amplitude  In 
voiceless  stops.  Ik-ntals  and  velars  have  about  the  same  RMS  amplitude.  wh<  ue  labial 

htr  sis  ar<  \s«  . t V . • r b%  about  12  ilH. 

In  measuring  tin  hurst  rrvqurm  * for  t h«  labials  it  u.*s  found  thut  th«*n*  is  a •*.  » i<»  r:i ng<*  of 
variation  m th<*  value  s found.  Sjnt  r thr  -rmira  of  /p.b  show  no  distinct  burst  frequency  and 
tin*  KMS  amplitudes  of  those  stops  are  weak,  wo  ha\»  decided  not  to  present  results  <»n  the  burst 
spectrum  for  labials. 


I1  ig.  S-5.  I)isti  itioi  thi  burst 
frequi-ncv  tor  i . 


rhe  distribution  of  hurst  frequency  for  t is  shown  in  Fig.  5-5.  Averaging  across  all  vow- 
els, tlu*  mean  burst  frequencs  was  found  to  be  3f>50  II/..  As  observed  earlier,  t he  distribution 
is  skewed  toward  ihe  high-frequency  region,  with  essentially  all  the  measured  values  above 
1000  Hr.  The  distribution  appears  to  be  bimodal.  Closer  examination  of  the  data,  in  fact 
shows  that  the  lower  |ieak  in  the  distribution  is  directly  related  to  the  underlying  features  of 
the  follow  ing  vowel,  excluding  the  rounded  and  retroflexed  vowels,  the  mean  burst  frequency 
for  t/  was  found  to  be  3900  It/,,  whereas  the  mean  burst  frequency  for  /t/  preceding  rounded 
or  retroflexed  vowels  was  found  to  be  approximately  3300  Hz.  These  two  mean  values  corre- 
spond well  with  the  two  peaks  in  the  distribution. 

The  averaged  burst  frequencies  for  i f are  plotted  as  a function  of  the  following  vowel,  and 
are  shown  in  Fig.  5-f>.  It  can  be  seen  that  burst  frequencies  for  /t/  preceding  rounded  or  retro- 
flexed  vowels  are  consistently  less  than  the  overall  mean  value,  whereas  the  burst  frequencies 
preceding  all  other  vowels  are  always  greater  than  the  mean  value. 

The  distribution  of  burst  frequency  for  /d/  is  shown  in  Fig.  5-7.  Compared  to  the  distribu- 
tion of  /t  bursts,  the  burst  frequency  distribution  for  /d/  appears  to  have  shifted  down  in  fre- 
quency. Averaging  across  vowel  context,  the  mean  value  was  found  to  be  about  3 300  Hz.  As  is 
the  ease  for  t , the  distribution  appears  to  be  bimodal  in  nature.  The  mean  burst  frequency 
for  d preceding  rounded  or  retroflexed  vowels  was  found  to  be  ,2950  Hz.  while  the  mean  value 
preceding  all  other  vowels  was  found  to  be  35  30  Hz.. 

In  Fig.  5-h  the  /<!  ' burst  frequency  is  plotted  as  a function  of  the  follow  ing  vowel.  Again, 
it  can  be  seen  that  the  averaged  burst  frequencies  for  /d/  preceding  rounded  and  retroflexed 
vowels  are  consistently  less  than  the  overall  mean  value,  whereas  the  averaged  burst  frequencies 
preceding  all  other  vowels  are  consistently  greater  than  the  mean. 


47 


Fig.  I-H.  Average  burst  frequency  for  singleton  d/  as  a function 
of  vowel  context. 


The  distribution  of  burst  frequency  for  /k/  is  shown  in  Fig,  5-9.  The  burst  frequencies  for 
k tern!  to  bo  distributed  in  the  low-  to  mid-frequency  region,  with  less  than  5 percent  of  the 
values  above  1000  II/.  Averaging  across  all  vowels,  the  mean  burst  frequency  was  found  to  t>e 
1910  II/.  However,  there  appear  to  be  three  distinct  [>eaks  in  ihe  distribution.  Closer  exam- 
ination of  data  reveals  that  these  peaks  are  again  attributable  to  the  underlying  features  of  the 
following  vowel.  The  mean  burst  frequency  for  /k/  preceding  front  vowels  w is  found  to  be 
2720  II/.  Fret  eding  back  and  unrounded  vowels,  the  mean  burst  frequency  was  found  to  be 
1 .'7  0 II.  . The  mean  burst  frequency  for  k preceding  back  and  rounded  vowel  - w as  found  to 
be  1.150  II  . These  three  valut  orrespond  well  with  the  locations  of  the  three  peaks  in  the 
distribution  shown  in  Fig.  5-9. 

1'he  burst  frequency  for  k,  is  plotted  in  Fig.  5-10  as  a function  of  the  following  vowel. 
Those  values  preceding  front  vowels  are  always  greater  than  the  mean.  The  frequencies  of 
bursts  preceding  back  vowels  are  consistently  less  than  the  mean  value,  with  those  preceding 
the  rounded  vowels  having  the  smallest  values. 

The  burst  frequency  distribution  for  'g  , as  shown  in  Fig.  5-11.  is  very  similar  to  that 
of  k . Averaged  over  all  vowel  context.  Ihe  mean  value  is  1940  II/.  The  mean  values  of 
burst  frequency  for  g preceding  front  vowels,  preceding  back  and  unrounded  vowels  and 
preceding  back  and  rounded  vowels  are  2720,  1770.  and  1250  Hz.  respectively.  The  aver 


Fig.  5-9.  Distribution  of  the  burst 
frequency  for  /k/. 


Fig.  5-io.  Average  burst  frequency  for  singleton  k as  a function 
of  vowel  context. 


49 


PfPCfNT  Of  TOTAL  POPULAT.O*! 


burst  frequencies  for  g are  plotted  in  Fig.  5-12  as  a function  of  the  vowel  con'ext.  Compar- 
ison of  Figs.  5-10  and  -t2  shows  that  the  vowel  dependency  of  /g/ burst  is  the  same  as  that  of 
/k / burst. 

The  average  burst  amplitudes  for  A.k.d.g/  are  summarized  in  Table  5-II.  From  Table  5 - XI. 
we  see  that  the  average  burst  amplitude  of  the  voiceless  stops  is  consistently  greater,  by  about 
2 dB,  than  that  of  the  voiced  stops.  The  burst  amplitude  is  greater  for  dentals  than  for  velars. 
The  averaged  burst  amplitude  for  the  dentals  and  the  velars  is  about  the  same  as  that  of  the 
following  vowel. 

1 >.2.2  Stops  in  Clusters 

The  average  KMS  amplitude  of  the  hurst  for  stops  in  clusters  is  compared  with  that  foi 
singleton  stops  in  Table  5 - III. 

The  values  in  Table  5 - III  represent  the  difference  between  the  RMS  amplitude  of  the  single- 
ton  stops  and  that  of  stops  in  clusters.  For  the  stop-sonorant  clusters,  there  is  a marked 
decrease,  by  some  8 dB.  in  RMS  amplitude  for  the  velars,  whereas  there  is  very  tittle  change 
in  RMS  amplitude  for  the  labials.  The  average  decrease  in  RMS  amplitude  is  about  3 dB  for  the 
dentals.  For  the  s-stop  clusters,  both  dentals  and  velars  show  a decrease  in  RMS  amplitude, 
with  dentals  having  a larger  change.  The  labials,  however,  show  an  increase  in  RMS  value 
by  2 dB. 

Figures  5-13  and  -14  plot  the  average  burst  frequencies  for  /t/  in  /tr/  clusters  and  k / in 
/kw/  clusters,  respectively,  as  a function  of  'he  vowel  context.  Also  plotted  in  these  figures 
are  the  corresponding  values  for  the  singleton  stops.  Figures  5-13  and  -14  serve  to  illustrate 
the  fad  that,  while  there  is  a marked  change  in  burst  frequency  from  singleton  stops  to  stops 
in  cluste  s.  this  change  is  more  related  to  the  nature  of  the  sonorant  involved  rather  than  the 
vowel  context.  Since  this  phenomenon  was  observed  consistently  for  all  the  clusters,  we  shall 
pool  the  results  across  vowel  context  for  the  stop-sonorant  clusters. 


TABLE  5-1 II 

DECREASE  IN  OVERALL  RMS  AMPLITUDE 
FROM  SINGLETON  TO  CLUSTERS 

Stop-Sonorant 

/ s/-Stop 

Stop 

Clusters 

Clusters 

/p/ 

-0.2 

-2.0 

/t/ 

2.3 

5.0 

A/ 

8.2 

3.0 

/b/ 

-0.6 

/d/ 

4.0 

/g/ 

8.7 

51 


T""  ' V 


r 


1 


l'ig.  5-15.  Average  burst  frequency 
for  t l.k.g  in  stop- sono rant  clusters. 


rh'  a\et  i .■  ' • i equen-  ies  I u stop-sonnraw  i lustc  rs  arc  shown  in  Tig.  5-15.  along 
with  values  I'm-  singleton  'tops.  The  average  burst  frequency  for  t in  t r clusters  was  found 
to  !»•  2lh0  II,-.  a decrease  of  more  than  1000  II..  The  burst  frequency  for  /l  in  iw  ’ clusters 
is  257n  ||/  slightly  greater  than  the  t in  t r clusters.  \s  is  the  case  for  singleton  stops, 
the  burst  frequencies  for  il  in  clusters  were  found  to  lie  some  .100  11/  less  than  their  voiceless 
1 unite  rpa  i ’ 2120  11/  -or  d in  dr  clusters  and  2210  11/  d in  dw  clusters,  respectively. 

I'lte  hurst  frequencies  for  the  velars  in  stop-sonorant  clusters  show  no  difference  across 
the  voicing  distinction.  Preceding  1 . the  burst  frequency  was  found  to  lie  I 320  H*  for  k and 
1230  11/  for  g . re  pectively.  Preceding  i . the  values  are  1240  11/  for  k and  1190  11/ 
for  g respectively.  The  burst  frequencies  were  found  t ■ • be  the  lowest  when  preceding  w 
1050  11/  for  k . and  9H0  11/  for  g /.  respectively. 

Averaged  across  all  vowels,  the  mean  burst  frequency  for  t in  si  clusters  was  • md 
to  be  32-10  II/.  This  value  is  closer  to  the  mean  value  of  singleton  d than  that  of  t . \~ 
shown  in  I'li*.  5 - 1 f the  burst  frequencies  i n t in  st  clusters  vary  with  vowel  context 
much  the  same  wav  as  singleton  /d/. 


f«c«a*w>n0  f m m a* 


Fig.  S-16.  Average  hurst  frequency  for  t in  st  clusters  as  a function 
of  vowe  1 t context. 


'<  • iTtTTJ 


r~ 

I 


m l IS  /«» / CLUSTERS 

' 


Fig.  S - 17.  Average  burst  frequency  for  /k/  in  / sk/  clusters  as  a function 
of  vowel  context. 


TABLE  5-IV 

INCREASE  IN  RELATIVE  BURST  AMPLITUDE 
FOR  STOPS  FROM  SINGLETON  TO  CLUSTERS 


Stop 


Stop- Sonor  ant 

/ s/-$top 

Clusters 

Clusters 

I 


The  burst  frequency  for  /k / in  /sk/  clusters  is  plotted  in  Fig.  5-17  as  a function  of  the 
following  vowel,  along  with  that  of  singleton  /g/.  Averaged  across  all  vowels,  the  mean  burst 
frequency  was  found  to  be  2010  11?,  about  the  same  as  singleton  /g/. 

Table  5-lV  compares  the  averaged  burst  amplitude  for  stops  in  clusters  with  that  of  single- 
ton  stops.  For  the  stop-sonorant  clusters,  there  is  a decrease  of  the  average  burst  amplitude 
of  about  -1  dll  for  the  velars,  whereas  the  average  burst  amplitude  for  the  dentals  increases 
slightly.  The  averaged  burst  amplitudes  for  /t/  and  A/  in  s-stop  clusters  decrease  by  2 to  3dI3, 
resulting  in  values  approximately  the  same  as  the  singleton  /d/  and  /g/.  respectively. 

5.3  DISCI  SSION 

Most  of  the  results  presented  in  the  previous  section  concern  only  the  dental  and  velar  stops. 
We  have  excluded  results  on  the  labials  because  their  releases  are,  in  general,  very  weak.  The 
average  KMS  amplitudes  for  the  labials  were  some  12  dll  less  than  the  dentals  and  the  velars. 
This  weak  release  makes  it  extremely  difficult  to  locate  the  burst  frequency.  The  fac*  that  the 
labials  lack  a clear,  distinct  burst  frequency  raises  the  question  of  whether  the  burst  frequency 
is  a perceptually  important  cue  in  the  identification  of  the  labials.  Although  the  presence  of  a 
burst  has  been  demonstrated  to  contribute  to  the  perception  of  /p.b/.  it  may  be  postulated  that 
the  spectral  concentration  of  the  burst  perhaps  is  not  as  important  as  the  fact  that  its  relative 
amplitude  is  much  weaker  than  that  of  a stop  with  another  place  of  articulation. 

The  burst  frequency  for  the  dentals  was  found  to  have  a distribution  that  is  bimodal  in 
nature.  Bursts  for  /t,d/  preceding  rounded  vowels  were  found  to  be  600  Hz  lower  than  those 
preceding  all  other  vowels.  This  lowering  of  burst  frequencies  is  presumably  the  consequence 
of  anticipatory  rounding  of  the  lips  during  the  stop  release.  Koundmg  reduces  the  opening  of 
the  vocal  tract,  and  the  protruding  lips  also  increase  the  length  of  the  front  cavity.  Both  of 
these  factors  have  the  effect  of  lowering  the  resonant  frequency  of  the  cavity.  It  is  also  pos- 
sible that  the  measured  burst  frequencies  for  the  rounded  and  the  unrounded  dentals  actually 
correspond  to  the  natural  frequencies  of  different  cavities  entirely.  W ith  no  rounding,  the  burst 
frequency  may  be  a consequence  of  the  constriction  itself,  rather  than  the  front  cavity. 

The  distribution  of  burst  frequency  for  the  velars  has  three  distinct  peaks.  The  average 
burst  frequencies  for  A.g/  preceding  front  and  back  vowels  differ  by  1200  Hz.  For  the  back 
vowels,  the  bursts  for  /k,g/  are  550  Hz  lower  preceding  rounded  vowels  than  preceding  un- 
rounded vowels.  The  front  and  back  velar  distinction  is  quite  well-known  in  phonetics  litera- 
ture  (for  example,  Heffner  I.  X-ray  studies  of  articulatory  movements  (for  example, 

Ferkell  I had  also  shown  that  the  velars  preceding  front  and  back  vowels  have  different  places 
of  articulation.  As  shown  in  Tig.  5-4,  this  articulatory  difference  is  transformed  into  acoustic 
characteristics  that  are  quite  dissimilar  for  the  front  and  back  A.g/'s.  Although  the  places  of 
articulation  might  change  from  front  to  back  along  a continuum,  the  measured  burst  frequencies 
clearly  show  only  two  or  three  discrete  values.  Our  measurement  of  burst  frequency,  there- 
fore, lends  further  support  to  the  theory  of  the  quantal  nature  of  speech  production  (Stevens36). 

Due  to  the  differences  in  the  sizes  and  shapes  of  the  vocal  tract,  the  burst  frequency,  in 
general,  does  vary  from  one  speaker  to  another.  However,  we  have  found  that  the  distribution 
for  a particular  stop  maintains  a certain  characteristic  independent  of  speaker  variation.  The 
effect  of  pooling  the  results  across  speakers  and  sessions  only  tends  to  be  a broadening  of  the 
skirts  of  the  distribution.  Figure  5-18  shows  the  distribution  of  /t/  for  a single  speaker. 
Comparing  Figs.  5-5  and  -18,  one  can  clearly  observe  the  same  characteristics  as  discussed 


55 


— yr  t 


Fig.  5-1H.  Distribution  of  the  burst  frequency 
for  /t/  for  a single  speaker  K\'S  (75  samples). 


earlier.  Speaker  KNS.  incidentally,  has  a tendency  to  centralize  the  vowel  /*/  and  diphthongs 
iv.  .»/  due  to  his  dialect  bat  kground  (Me David  ^ 1.  This  phenomenon  accounts  for  the  higher 
values  of  burst  frequency  or  the  velars  preceding  /".ay,  aw/. 

We  have  found  essentially  no  difference  in  burst  frequency  values  for  the  voiceless  and 
voiced  velars.  There  is,  however,  a consistent  tendency  for  the  voiced  dental  to  have  a burst 
frequency  that  is  200  to  300  Hz  less  than  its  voiceless  counterpart  under  the  same  phonetic  en- 
vironment. In  fact,  the  same  trend  can  also  be  observed  in  clusters.  In  addition,  burst  fre- 
quency for  the  voiceless  unaspirated  /t/  was  found  to  have  values  that  correspond  closer  to  /d 
than  to  the  aspirated  /t/.  It  is  possible  that  /t/  and  /d/  articulations  actually  involve  different 
tongue  positions,  thus  resulting  in  the  burst  frequency  shift.  It  is  also  possible  that  the  shift 
is  due  to  the  different  positions  of  the  larynx  for  the  voiced  and  voiceless  stops.  Changing  the 
length  of  the  back  cavity,  assuming  finite  coupling,  can  shift  the  burst  frequency,  Another  pos- 
sible explanation  lies  in  the  differences  in  the  source  spectra  between  the  voiced  and  voiceless 
stops.  The  frequency  location  of  the  peak  in  the  source  spectrum  can  shift  as  a result  of  the 
differences  in  volume  flow  between  the  voiced  and  voiceless  stops  (Stevens,  personal  communi- 
cation). The  voiceless  stops,  with  a higher  volume  flow  across  the  constriction,  will  have  a 
source  spectrum  that  peaks  at  higher  frequency. 

Our  results  indicate  that  the  average  burst  amplitude  for  /t.k.d.g/  is  approximately  the 

same  as  that  of  the  following  vowel  in  the  same  frequency  region.  In  a recently  reported  per- 

38 

ceptual  experiment,  Bush.  Stevens,  and  Hlumstein  also  obtained  a similar  finding.  The  burst 
amplitude  was  varied  relative  to  the  spectral  peak  of  the  following  vowel  in  the  same  frequency 
region,  and  the  subjects  were  asked  to  judge  the  quality  of  the  stops.  Hush  et  al.  reported  that 
the  subjects  consistently  judged  the  best  /d.g/  as  having  a burst  amplitude  comparable  to,  or 
even  less  than,  that  of  the  following  vowel.  These  results  contradict  the  common  notion  that 
burst  amplitude  should  be  greater.  It  is  possible  that  the  auditory  system  in  doing  the  frequency 
analysis  utilizes  a different  window  that  might  be  more  sensitive  to  onset  or  transient  phenomena 
(Stevens,  personal  communication),  thus  resulting  in  a greater  burst  amplitude.  It  is  also  pos- 
sible that  the  notion  of  greater  burst  amplitude  was  concluded  from  observations  made  on  spec- 
trograms, where  the  AGC  circuitry  tends  to  boost  the  burst  intensity. 

Kesults  in  Table  5-II  also  show  that  the  burst  amplitude  is  on  the  order  of  2 dll  higher  for 
39 

the  voiceless  stops.  Stevens  has  shown  that  in  the  generation  of  turbulence  noise,  the  radi- 
ated sound  pressure  of  the  noise  is  proportional  to  the  three-halves  power  of  the  pressure  drop 
across  the  constriction,  all  else  being  equal.  The  measured  2-dll  difference  in  burst  amplitude, 
according  to  the  above  relationship,  will  result  in  a pressure  drop  ratio  of  approximately  1.  15. 

This  calculated  ratio  in  pressure  drop  is  in  good  agreement  with  the  measured  pressure  drops 

40  41 

reported  by  several  investigators  (l.ubker  and  Farris,  l.isker  I. 


W8-' 


56 


Ha  sod  on  tho  measurement;  on  the  burst  durations  of  the  voiced  and  voiceless  stops.  Klatt 
has  estimated  that  the  difference  in  duration  alone  can  make  the  burst  of  a voiceless  stop  be 
perceived  as  at  lca..••  4 ,ih  louder.  Our  measured  difference  in  burst  amplitudes  will  further 
contribute  to  the  difference  in  perceived  loudness  between  voiced  and  voiceless  stops. 

Results  on  tho  RMS  at  ,'litude  of  stops  in  clusters  indicate  a marked  decrease  in  tho  values 
for  velars.  This  could  Ik-  a di  ivi't  consequence  of  the  fact  that  kl.kr.ku  / all  have  a secondary 
constriction  in  front  of  the  noise  source,  thus  reducing  the  pressure  drop  across  the  primary 
constriction  and  the  resulting  radiated  sound  pressure. 

When  stops  appear  in  stop-sonorant  clusters,  the  hurst  trequency  is  dependent  on  the  nature 
of  the  following  sonorant  with  the  following  vowel  having  little  or  no  influence.  The  lowering 
of  burst  frequency  for  stops  in  clusters  is  accountable  by  considering  the  coarticulatory  effect 
of  the  following  sonorant  (l-'ant4^l.  Rounding  and  retroflexing  cause  about  the  same  amount  of 
lowering  for  dental  burst,  whereas  rounding  has  the  predominant  effect  of  lowering  the  burst 
frequency  for  velars.  It  is  of  interest  to  note  that  the  average  burst  frequency  for  /t/  in  /tr/ 
clusters  was  found  to  be  approximately  2500  Hz.  This  measured  value  is  slightly  less  than  the 
most  preferred  burst  frequency  for  retroflexed  /t/'s  found  in  a perceptual  experiment  recently 
reported  by  Stevens  and  Ulumstein.43  This  difference  is  possibly  due  to  the  fact  that  the  position 
of  the  constriction  for  retroflexed  /t/  is  more  anterior  than  that  of  /t/  in  /tr,/  clusters. 


' 


CHAPTER  6 

CONCH  DING  REMARKS 

This  report  has  two  distinct  and  integral  parts.  The  first  part  is  directed  toward  the 
development  of  a general  facilitj  where*  controlled  studies  of  the  acoustic  characteristics  of 
sele<  ted  consonants,  consonant  clusters,  and  vowels  in  a prescribed  phonetic  environment  can 
be  carried  out.  The  ty pe  of  data  we  have  collected,  and  the  process  through  which  they  were 
collet  ted  have  been  described  in  Chap.  2.  Various  aspects  of  the  analysis  system  and  the  data- 
base facility  have  been  described  in  Chap.  3. 

The  set  ond  half  of  the  report  utilizes  the  collected  data  and  the  developed  facility  to  study 
the  acoustic  characteristics  of  the  English  stops.  The  temporal  characteristics  of  these  stops 
were  presented  in  Chap.  4 and  the  spectral  characteristics  in  Chap.  S. 

We  approached  the  report  with  the  belief  that  in  order  to  study  the  acoustic  characteristics 
of  speech  sounds  and  provide  a strengthened  basis  for  linguistic  and  phonetic  theories,  one  must 
examine  a large  body  of  data,  under  a controlled  environment,  and  with  the  help  of  an  interactive 
computer  facility . Our  experience  and  the  results  presented  in  this  report  clearly  substantiated 
our  claim.  Although  the  data-base  facility  is  no  longer  available,  due  to  the  termination  of  ser- 
vice of  the  TX-2  computer,  this  report  hopefully  has  so  demonstrated  the  importance  and  neces- 
sity of  such  a facility  that  similar  ones  will  be  developed  on  other  computer  systems. 

The  results  we  found  on  the  temporal  characteristics  of  the  English  stops  are,  with  minor 

U 

exceptions,  in  good  agreement  with  those  recently  reported  by  Klatt  on  a substantially  smaller 
data-base.  However,  the  results  on  the  spectral  characteristics  represent,  to  our  knowledge, 
a first  attempt  to  quantify  the  release,  both  in  frequency  and  in  amplitude.  These  data  will 
hopefully  be  valuable  for  immediate  applications  such  as  speech  synthesis  and  speech  recognition. 
They  also  will  serve  to  help  us  gain  a better  understanding  of  the  production  and  perception  of 
speech. 

The  question  of  the  presence  of  acoustic  invariance  of  phonetic  features  has  constantly  been 
raised  over  the  past  two  decades  and  the  answer  has  thus  far  eluded  us.  Even  under  a controlled 
environment,  such  as  the  one  in  this  re|>ort,  we  found  that  a very  complex  set  of  relationships 
must  operate  and  mediate  among  the  various  acoustic  realiz.ations  of  a given  feature.  The  fea- 
ture voicing,  for  example,  manifests  itself  in  a difference  in  fundamental  frequency,  VOT.  burst, 
amplitude,  and  a multitude  of  others.  All  things  being  equal,  each  of  these  dimensions  might  be 
sufficient  in  making  the  voice-voiceless  distinction.  When  the  condition  is  less  ideal,  however, 
the  interactions  among  features  will  have  to  be  considered.  For  example,  some  decision  of  the 
place  of  articulation  may  have  to  be  made  before  VOT  can  be  used  to  distinguish  voiced  and 
voiceless  stops. 

In  the  course  of  this  report  research,  we  have  been  able  to  isolate  and  quantify  certain 
acoustic  attributes  of  the  underlying  features,  such  as  voicing  and  place  of  articulation.  Although 
the  effect  of  context  and  the  interaction  w ith  other  features  are  often  observed,  we  nevertheless 
found  many  aspects  of  the  results  to  be  context  independent.  While  the  exact  nature  of  the  inter- 
action among  features  and,  consequently,  their  acoustic  correlates  are  not  yet  understood,  we 
feel  that  a necessary  step  has  been  taken  toward  a better  understanding  of  the  problem. 


REFERENCES 


1.  1,  Jakobson,  G.  l'ant,  and  M.  Halle.  Preliminaries  to  Speech  Analysis 
(M.l.T.  Press,  Cambridge,  1951). 

2.  F.S.  Copper,  P.  C.  Delattre,  A.  M.  Liberman,  J.  M.  Horst,  and  L.  J. 
Gerstman,  "Some  Experiments  on  the  Perception  of  Synthetic  Speech 
Sounds,"  J.  Acoust.  Soo.  Am.  24,  597-606  (1952). 

1.  P.  C.  Delattre,  A.  M.  Liberman,  and  F.  S.  Cooper,  "Acoustic  Loci  and 

Transitional  Cues  for  Consonants,"  J.  Acoust.  Soc.  Am.  27,  769-775  (1954). 

4.  E.  l'ischer-Jorgensen,  "Acoustic  Analysis  of  Stop  Consonants,"  Miscellanea 
Phonetica  XI,  42-59  (1954). 

5.  M.  Halle,  G.  W.  Hughes,  and  J-P.  A.  Radley,  "Acoustic  Properties  of  Stop 
Consonants,"  ,1.  Acoust.  Soc.  Am.  29,  107-116  (1957). 

6.  L.  Lisker  and  A.  S.  Abramson,  "A  Cross -Language  Study  of  Voicing  in 
Initial  Stops:  Acoustic  Measurements,"  Word  20,  No.  3,  384-422  (Decem- 
ber 1964). 

7.  L.  Lisker  and  A.  S.  Abramson,  "Some  Effect  of  Context  on  Voice  Onset 
Time  in  English  Stops,"  Lang,  and  Speech  10,  1-28  (1967). 

8.  D.  H.  Klatt,  "Voice  Onset  Time,  Frication,  and  Aspiration  in  World- 
initial  Consonant  Clusters,"  J.  Speech  Hear.  lies.  1_H,  No.  4,  686-706 
(December  1975). 

9.  K.  N.  Stevens  and  M.  Klatt,  "Study  of  Acoustic  Properties  of  Speech 
Sounds,"  Holt,  Beranek  and  Newman  Report  No.  1669  (1968). 

10.  K.  N.  Stevens,  "Study  of  Acoustic  Properties  of  Speech  Sounds  II,  and 
Some  Remarks  on  the  Use  of  Acoustic  Data  in  Schemes  for  Machine 
Recognition  of  Speech,"  Holt,  Beranek  and  Newman  Report  No.  1871  (1969). 

11.  A.  S.  House  and  G.  Fairbanks,  "The  Influence  of  Consonant  Environment 
upon  the  Secondary  Acoustic  Characteristics  of  Vowels,"  J.  Acoust.  Soc. 

Am.  25,  No.  1.  105-113  (January  1953). 

12.  K.  N.  Stevens  and  A.  S.  House,  "Perturbation  of  Vowel  Articulations  by 
Consonant  Context:  An  Acoustic  Study,"  J.  Speech  Hear.  Res.  6,  No.  2, 
111-128  (1963). 

13.  K.  N.  Stevens,  A.  S.  House,  and  A.  P.  Paul,  "Acoustic  Description  of 
Syllable  Nuclei:  An  Interpretation  in  Terms  of  a Dynamic  Model  of 
Articulation,"  J.  Acoust.  Soc.  Am.  40,  No.  1,  123-1  32  (July  1966). 

14.  II.  S.  Ata  1 and  S.  L.  Ilanauer,  "Speech  Analysis  anil  Synthesis  by  Linear 
Prediction  of  the  Speech  Wave,"  J.  Acoust.  Soc.  Am.  50,  637-665 
(August  1971). 

15.  J.  D.  Markel,  Formant  Trajectory  Estimation  from  a Linear  Least 
Squares  Inverse  Filter  Formulation.  Speech  Communications  Research 
Laboratory,  Inc.  Monograph  No.  7 (SCRL,  Inc.,  Santa  Barbara,  Cali- 
fornia, 1971). 

16.  J,  I.  Makhoul  and  J.  J.  Wolf,  "Linear  Prediction  and  the  Spectrum 
Analysis  of  Speech,"  Bolt,  Beranek  and  Newman  Report  No.  2304  (1972). 

17.  V.  N.  Faddeeva,  Computational  Methods  of  1. inear  Algebra  (Dover 
Publications,  New  York,  1959). 

18.  ft.  Levinson,  Appendix  Hof  Extrapolation  and  Smoothing  of  Stationary 
Time  Series,  by  N.  Wiener  (M.l.T.  Press,  Cambridge.  19d9). 

19.  H.  S.  Atal,  "Sound  Transmission  in  the  Vocal  Tract  with  Applications 
to  Speech  Analysis  and  Synthesis,"  Proc.  7th  Int.  Congress  Acoust., 
Budapest,  Hungary,  August  1971. 

20.  V,  W.  Zue,  "Speech  Analysis  by  Linear  Prediction,"  Quarterly  Progress 
Report  No,  105,  Research  Laboratory  of  Electronics,  M.l.T.  (April  1972), 
pp.  1 33-142. 


mm 


60 


21.  M.  R.  Portnoff,  V.  W.  Zuc,  and  A.  V.  Oppenheim,  "Some  Considerations 
in  the  Use  of  Linear  Prediction  for  Speech  Analysis,"  Quarterly  Progress 
Re|x>rt  No.  106,  Research  Laboratorv  of  Electronics,  M.l.T.  (July  1972), 
pp.  141-150. 

22.  K.  Itakura  an?)  S.  Saito,  "Analysis  Synthesis  Telephony  Rased  on  the 
Maximum  Likelihood  Method,"  Report  of  the  6th  Int.  Congress  Acoust., 
Tokyo,  Japan  (1968),  Vol.  II,  Paper  C-5-5. 

2).  A.  V.  Oppenheim  and  R.  M.  Schafer,  "Homomorphic  Analysis  of  Speech," 
IEEE  Trans.  Audio  and  Electroacoust.  AU-16,  No.  2,  221  -226  (June  1968). 

24.  H.  S.  Atal  and  M.  H.  Schroeder,  "Recent  Advances  in  Predictive  Coding 
Application  to  Speech  Synthesis,"  Preprints  of  the  Speech  Communication 
Seminar,  Stockholm,  Sweden,  August  1974,  Vol.  1,  pp.  27-31. 

29.  A.  Stowe,  "SPC  on  TX-2,"  Internal  Memorandum,  Lincoln  Laboratory, 
M.l.T.  (1972). 

26.  A.  S.  House,  "On  Vowel  Duration  in  English,"  J.  Acoust.  Soc.  Am.  33, 
1174-1178  (1961). 

27.  G.  Peterson  and  I.  Lehiste,  "Duration  of  Svllalile  Nuclei  in  English," 

J.  Acoust.  Soc.  Am.  32,  693-703  (1960). 

28.  G.  Kant,  K.  Ishizaka,  J.  Lindgvist,  and  J.  Sundberg,  "Subglottal  f or- 
mants," Quarterly  Progress  and  Status  Report,  Speech  Transmission 
Laboratory,  Royal  Institute  of  Technology,  Stockholm,  Sweden 
(April  1972),  pp.t-15. 

29.  K.  N.  Stevens  and  I).  H.  Klatt,  "The  Role  of  Formant  Transitions  in  the 
Voice-Voiceless  Distinction  for  Stops,"  J.  Acoust.  Soc.  Am.  55, 

653-659  (1974). 

30.  R.  A.  Houde,  "A  Study  of  Tongue  Body  Motion  during  Selected  Speech 
Sounds,"  Doctoral  dissertation.  University  of  Michigan,  1967. 

31.  J.  S.  Perkell,  "Physiology  of  Speech  Production:  Results  and  Implica- 
tions of  a Quantitative  Cineradiographic  Study,"  Research  Monograph 
No.  53  (M.l.T.  Press,  Cambridge,  1969). 

32.  K.  N.  Stevens,  "Modes  of  Conversion  of  Airflow  to  Sound,  and  Their 
Utilization  in  Speech,"  Paper  presented  at  the  8th  Int.  Congress  of 
Phonetic  Sciences,  Leeds,  England,  August  1975. 

33.  M.  Halle  and  K.  N.  Stevens,  "A  Note  on  Laryngeal  Features,"  Quarterly 
Progress  Report  No.  101,  Research  Laboratory  of  Electronics,  M.l.T. 
(April  1971),  pp.  198-213. 

34.  K.  N.  Stevens,  "Further  Theoretical  and  Experimental  liases  for  the 
Quantal  Haces  of  Articulation,"  Quarterly  Progress  Report  No.  108, 
Research  Laboratory  of  Electronics,  M.l.T.  (1973),  pp.  247-252. 

35.  R-M.S.  Heffner,  General  Phonetics  (University  of  Wisconsin  Press, 
Madison,  1950). 

3 6.  K.  N.  Stevens,  "The  Quantal  Nature  of  Speech:  Evidence  from 

Articulatory-acoustic  Data,"  in  Human  Communications:  A Unified 

View.  E.  E.  David,  Jr.  and  P.  B.  Denes,  Eds.  (McGraw-Hill,  New 
York,  1972). 

17.  R.  I.  McDavid,  Jr.,  "Some  Social  Differences  in  Pronunciation,"  in 

Readings  in  Applied  English  Linguistics,  H.  B.  Allen,  Ed.  (Appleton- 
Century -Crofts,  New  York,  2nd  Edition,  1967). 

38.  M.  A.  Bush,  K.  N.  Stevens,  and  S.  E.  Blumstein,  "Burst  Intensity  and 
Position  in  the  Perception  of  Voiced  Stop  Consonants,"  J.  Acoust.  Soc. 

Am.  52,  S40  (1976). 

19.  K.  N.  Stevens,  "Airflow  and  Turbulence  Noise  for  Fricative  and  Stop 
Consonants:  Static  Considerations,"  J.  Acoust.  Soc.  Am.  50, 

1180-1192  (1971). 


61 


40.  J.  F.  Lubker  and  P.  J.  Parris,  "Simultaneous  Measurements  of  Intraoral 
Pressure,  Force  of  Labial  Contact  and  Labial  Electromyographic  Activity 
during  the  Production  of  Stop  Cognates  /p/  and  /b/,"  J.  Acoust.  Soc. 

Am.  47,  625-633  (1970). 

41.  L.  Lisker,  "Supraglottal  Air  Pressure  in  the  Production  of  English 
Stops,"  Lang,  and  Speech  13,  215-230  (1970). 

42.  G.  Fant,  Speech  Sounds  and  Features  (M.I.T.  Press,  Cambridge,  1973). 

43.  K.  N.  Stevens  and  S.  E.  Blumstein,  "Quantal  Aspects  of  Consonant  Pro- 
duction and  Perception:  A Study  of  Retroflex  Stop  Consonants,"  J.  Pho- 
netics i.  215-233  (1975). 


62 


f 


UNCLASSIFIED 


SECURITY  Cl  ASSItlGATlON  OF  THIS  PAGE  i>Am  Polo  fnlfti) 


[I'i  REPORT  DOCUMENTATION  PAGE 

ESn|rR-76-ll2/ 


2.  GOVT  ACCESSION  HO 


TITLE  and  Subtitle  1 


Acoustic  Characteristics  of  Stop  Consonants:  1 

A Controlled  Study*  ^ — — 1 " - » 


dk 


7 AUTHOR' 

v— — - — — 

Victor 


\V.  i 


Consultant 


9.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Lincoln  Laboratory,  M.  I.  T. 

P.  O.  Box  73 
Lexington,  MA  02173 


II.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Defense  Advanced  Research  Projects  .Agency 
1400  Wilson  Boulevard 
•Arlington,  VA  22209 


14  MONITORING  AGENCY  NAME  & ■ . . -■  f ■ ff-,  -f  | ftffim  I j 

Systems  Divl-io  /ft  V £ £ 0)  / 

IA  01731  ^ ^ 


Electronic 
Hanscom  AH3 
Bedford,  MA  01731 


RE  AD  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


1 RECIPIENT'S  CATALOG  NUM8ER 


■ Ti'pp  qp  report  a Mcmaa  i 


Technical  Report* 


6 PERFORMING  ORG,  R^>OierNUMBFR_ 

Technical  Report  S23 


_8.  CONTRACT  OR  GRANT  HUMBERT'/ 


. j F19628-76-C -0t)O2  A , 

//flntiC  . T , - V?<i 


AREAI&  WOITO 

J2006 

Program  Element  No.  6270b E 
Project  No.  6P10 


TrrrrjwT  date? 


17  May  l»76 


13.  NUMBER  OF  PAGES 

72 


15.  SECURITY  CLASS,  (of  this  report! 

Unclassified 


15 o.  DECLASSIFICATION  DOWNGRADING 
SCHEDULE 


16  DISTRIBUTION  STATEMENT  <of  this  Report) 


/l  a .t 

Approved  for  public  release;  distrilxition  unlimited.  ' ^ 


17.  DISTRIBUTION  STATEMENT  (of  the  abstract  entered  in  Block  20,  if  different  from  Report) 


18  SUPPLEMENTARY  NOTES 

None 


19,  KEY  WORDS  (Continue  on  reverse  side  if  necessary  and  identify  by  block  number) 


acoustic  phonetics 
stop  consonants 
data-base 
speech  analysis 


digital  signal  processing 
voice-onset  time 
burst  frequency 


burst  amplitude 
acoustic  invariance 
speech  communication 


70  ABSTRACT  ((  onhnue  on  reverse  side  if  necessary  and  identify  by  block  number} 

The  men rch  described  In  rhia  report  In*  two  distinct  and  Integral  parts.  Ihe  flrat  part  la  directed  toward  the  development  of  a highly 
Interactive  computer  fa(  iliry  where  controlled  atutliea  of  the  acoustic  characteriatlca  of  selected  consonants,  consonant  clusters,  and  vowels 
in  a prescribed  phonetit  environment  tan  la  carried  <iut.  In  conjunction  with  the  development  of  the  data-base  faclllrv,  a large  corpus  «»f 
acouatlc  'lata  has  lieen  collected.  The  format  -rf  the  data  Is  a nonsense  ha’C VC  utterance  embedded  In  a carrier  sentence  "Say  again," 

where  the  < onsonanf*  and  v^ewcI#  are  ay  sterna  rlcaily  varied.  Fifteen  vowels  and  diphthongs  were  used  to  form  the  syllable  nuclei  and 
51  word-initial  consonants  and  consonant  r lusters  were  Inc loded . 

rhe  second  half  of  the  report  utilises  tin*  . • dinted  ilata  ami  the  developed  facility  to  study  the  acoustic  characteristics  of  English  stops, 
bath  in  AintfJet'Mi  and  in  cluster*.  Jhr  uta  im  loded  1 J2h  utterances  spoken  hy  3 male  speakers.  Various  aspects  of  the  temporal  and  spec- 
tral i hararterisrt*  * of  rh«**>  Mopa  were  quantified  ami  discussed  In  detail.  The  findings  in  general  suggest  the  presence  of  context  Independ- 
ent acoustic  properties  for  these  stop*  fhe  exact  nature  at  the  acoustic  Invariance,  however,  still  remains  a topic  of  further  Investigation . 


DO  1473  EDITION  OF  1 NOv  *5  IS  OBSOLETE 

1 JAN  73 

MUiO 


UNCLASSIFIED 


SECURITY  CLASSlFICAJION  OF  IMIS  PACE  Shrn  l),un  f’Ur  rJ 


r 


